Some explanation to some errors and warnings: Difference between revisions

From Helpful
Jump to navigation Jump to search
mNo edit summary
Line 72: Line 72:
[[Category:Unices]]
[[Category:Unices]]
[[Category:Warnings and errors]]
[[Category:Warnings and errors]]
==Argument list too long==
...in a linux shell, often happens when you used a * somewhere in your command.
The actual reason is a little lower level:
Shells will expand [[shell globs]] before it executes a command, so e.g. {{inlinecode|cp * /backup/}} actually might happen to expand to a long list of files.
Either way, it may create a very large string to be handed to the exec().
You get this error when that argument list is too long for the chunk of kernel memory reserved for passing such strings - which is hard-coded in the kernel {{comment|(MAX_ARG_PAGES, usually something like 128KB)}}.
You can argue it's a design flaw, or that it's a sensible guard against a self-DoS, but either way, that limit is in place.
There are various workable solutions:
* if you meant 'everything in a directory', then you can often specify the directory and a flag to use recursion
* if you're being selective, then {{inlinecode|find}} may be useful, and it allows doing things streaming-style, e.g.
: {{inlinecode|find . -name '*.txt' -print0 | xargs -0 echo}}  {{comment|(See also [[find and xargs]])}}
* Recompiling the kernel with a larger MAX_ARG_PAGES - of course, you don't know how much you'll need, and this memory is permanently inaccessible for anything else so just throwing a huge number at is is not ideal
Note
* that most of these split the set of files into smaller sets, and execute something for each of these sets. : In some cases this significantly alters what the overall command does. <!-- For example, if the command creates (as in replaces) an archive and does this a few times, most of the work will be overwritten and only the last set will be in the file you'll end up with.-->
: You may want to think about it, and read up on xargs, and its --replace.
* {{inlinecode|for filename in `ls`; do echo $filename; done}} is '''not''' a solution, nor is it at all safe against special characters.
: {{inlinecode|ls &#124; while read filename ; do echo $filename; done}} {{comment|(specifically for bourne-type shells)}} works better, but I find it harder to remember ''why'' exactly so use find+xargs.
[[Category:Unices]]
[[Category:Workaround]]
[[Category:Warnings and errors]]




Line 94: Line 141:
* figure out the specific cause
* figure out the specific cause
** typically: clean the path of long or unnecessary or duplicate entries
** typically: clean the path of long or unnecessary or duplicate entries


==Warning: "MAX NR ZONES" is not defined==
==Warning: "MAX NR ZONES" is not defined==

Revision as of 15:12, 14 July 2023


INFO: task blocked for more than 120 seconds.

Under heavy IO load on servers you may see something like:

INFO: task nfsd:2252 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

...probably followed by a call trace that mentions your filesystem, and probably io_schedule and sync_buffer.


This message is not an error, it's telling you that this process has not been scheduled on the CPU at all for 120 seconds, because it was in uninterruptable sleep state. (The code behind this message sits in hung_task.c and was added somewhere around 2.6.30. This is a kernel thread that detects tasks that stays in the D state for a while)


At the same time, 120 real-world seconds is an eternity for the CPU, and most programs, and most users.

Not being scheduled for that long typically signals resource starvation, usually IO, often some disk API. Which means you usually don't want to silence or ignore that message, because you want to find out when and why this happened, and probably avoid it in the future.


The stack trace can help diagnose what it was doing. (which is not so informative of the reason - the named program is often the victim of another one misbehaving, though it is sometimes the culprit)


Reasons include

  • the system is heavily swapping, possibly to the point of trashing, due to memory allocation issues
could be any program
  • the underlying IO system is very slow for some reason
I've seen mentions of this happening in VMs that share disks
  • specific bugs (in kernel code, systemd) have caused this as a side effect






Notes:

  • if it happens constantly your IO system is slower than your IO use
  • can happen to a process that was ioniced into the idle class,
which means ionice is working as intended, because idle-class is meant as an extreme politeness thing. It just indicates something else is doing a consistent bunch of IO right now (for at least 120 seconds), and doesn't help find the actual cause
e.g. updatedb, which may be the recipient if it were ioniced
  • if it happens only nightly, look at your cron jobs
  • a trashing system can cause this, and then it's purely a side effect of program using too more memory than there is RAM
  • being blocked by a desktop-class drive with bad sectors (because they retry for a long while)


  • NFS seems to be a common culprit, probably because it's good at filling the writeback cache, something which implies blocking while writeback happens - which is likely to block various things related to the same filesystem. (verify)
  • if it happens on a fileserver, you may want to consider spreading to more fileservers, or using a parallel filesystem


if your load is fairly sequential, you may get some relief from using the noop io scheduler (instead of cfq) though note that that disables ionice)
if your load is relatively random, upping the queue depth may help


Argument list too long

...in a linux shell, often happens when you used a * somewhere in your command.


The actual reason is a little lower level: Shells will expand shell globs before it executes a command, so e.g. cp * /backup/ actually might happen to expand to a long list of files.

Either way, it may create a very large string to be handed to the exec().


You get this error when that argument list is too long for the chunk of kernel memory reserved for passing such strings - which is hard-coded in the kernel (MAX_ARG_PAGES, usually something like 128KB).

You can argue it's a design flaw, or that it's a sensible guard against a self-DoS, but either way, that limit is in place.


There are various workable solutions:

  • if you meant 'everything in a directory', then you can often specify the directory and a flag to use recursion
  • if you're being selective, then find may be useful, and it allows doing things streaming-style, e.g.
find . -name '*.txt' -print0 | xargs -0 echo (See also find and xargs)
  • Recompiling the kernel with a larger MAX_ARG_PAGES - of course, you don't know how much you'll need, and this memory is permanently inaccessible for anything else so just throwing a huge number at is is not ideal


Note

  • that most of these split the set of files into smaller sets, and execute something for each of these sets. : In some cases this significantly alters what the overall command does.
You may want to think about it, and read up on xargs, and its --replace.
  • for filename in `ls`; do echo $filename; done is not a solution, nor is it at all safe against special characters.
ls | while read filename ; do echo $filename; done (specifically for bourne-type shells) works better, but I find it harder to remember why exactly so use find+xargs.


Word too long

A csh error saying that a command is over 1024 charaters long (1024 being the default, as of this writing at least).


Which is usually caused by a long value.

And often specifically by a line like:

setenv PATH ${PATH}:otherstuff

...often specifically PATH or LD_LIBRARY_PATH as they are most easily already 1000ish characters long.

You can check that with something like

echo $PATH | wc -c
echo $LD_LIBRARY_PATH | wc -c


In general, you have a few options:

  • switch to a shell that doesn't have this problem
  • recompile csh with a larger BUFSIZE
  • figure out the specific cause
    • typically: clean the path of long or unnecessary or duplicate entries

Warning: "MAX NR ZONES" is not defined

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

This is related to a (automatically generated) file called bounds.h in your kernel source, and probably means you need to do a make prepare in your /usr/src/linux.


Wrong ELF class: ELFCLASS32

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Usually means a 64-bit app is trying to load a 32-bit library.

In my case, a 64-bit binary trying to load a 32-bit .so file, largely because LD_LIBRARY_PATH included the 32-bit but not the 64-bit library directory.

If this is in a compiled application, it may mean you need to compile it from scratch before it notices the right one. (verify)

Could be triggered by LD_PRELOAD tricks.