|
|
Line 1: |
Line 1: |
| | | #redirect [[Some_explanation_to_some_errors_and_warnings#INFO:_task_blocked_for_more_than_120_seconds.]] |
| Under heavy IO load on servers you may see something like:
| |
| INFO: task nfsd:2252 blocked for more than 120 seconds.
| |
| "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
| |
| ...probably followed by a call trace that mentions your filesystem, and probably io_schedule and sync_buffer.
| |
| | |
| | |
| '''This message is not an error''', it's telling you that this process has not been scheduled on the CPU ''at all'' for 120 seconds, because it was in [[uninterruptable sleep]] state. {{comment|(The code behind this message sits in <tt>hung_task.c</tt> and was added somewhere around <tt>2.6.30</tt>. This is a kernel thread that detects tasks that stays in the [[D state]] for a while)}}
| |
| | |
| | |
| '''At the same time''', 120 real-world seconds is an ''eternity'' for the CPU, and most programs, and most users.
| |
| | |
| Not being scheduled for that long typically signals resource starvation, usually IO, often some disk API.
| |
| Which means you usually don't want to silence or ignore that message,
| |
| because you want to find out when and why this happened, and probably avoid it in the future.
| |
| | |
| | |
| The stack trace can help diagnose what it was doing. {{comment|(which is not so informative of the ''reason'' -
| |
| the named program is often the victim of another one misbehaving, though it is sometimes the culprit)}}
| |
| | |
| | |
| Reasons include
| |
| * the system is heavily [[swapping]], possibly to the point of [[trashing]], due to memory allocation issues
| |
| : could be any program
| |
| | |
| * the underlying IO system is very slow for some reason
| |
| :: I've seen mentions of this happening in VMs that share disks
| |
| | |
| * specific bugs (in kernel code, systemd) have caused this as a side effect
| |
| | |
| | |
| | |
| | |
| | |
| | |
| <!--
| |
| {{comment|(...though you can explicitly set <tt>sysctl_hung_task_panic</tt>, in which case your host is now panicked)}}
| |
| -->
| |
| | |
| | |
| | |
| | |
| Notes:
| |
| * if it happens constantly your IO system is slower than your IO use
| |
| | |
| * can happen '''to''' a process that was [[ionice]]d into the idle class,
| |
| : which means ionice is working as intended, because idle-class is meant as an extreme politeness thing. It just indicates something else is doing a consistent bunch of IO right now (for at least 120 seconds), and doesn't help find the actual cause | |
| : e.g. [http://en.wikipedia.org/wiki/Locate_%28Unix%29 updatedb], which may be the recipient if it were ioniced
| |
| | |
| * if it happens only nightly, look at your cron jobs
| |
| | |
| * a [[trashing]] system can cause this, and then it's purely a side effect of program using too more memory than there is RAM
| |
| | |
| * being blocked by a desktop-class drive with bad sectors (because they retry for a long while)
| |
| | |
| | |
| * NFS seems to be a common culprit, probably because it's good at filling the writeback cache, something which implies blocking while writeback happens - which is likely to block various things related to the same filesystem. {{verify}}
| |
| | |
| * if it happens on a fileserver, you may want to consider spreading to more fileservers, or using a parallel filesystem
| |
| | |
| | |
| * tweaking the linux io scheduler for the device may help (See [[Computer_data_storage_-_General_%26_RAID_performance_tweaking#OS_scheduling]])
| |
| : if your load is fairly sequential, you may get some relief from using the <tt>noop</tt> io scheduler (instead of <tt>cfq</tt>) though note that that disables [[ionice]])
| |
| : if your load is relatively random, upping the queue depth may help
| |
| | |
| | |
| | |
| [[Category:Unices]]
| |
| [[Category:Warnings and errors]]
| |