INFO: task blocked for more than 120 seconds.: Difference between revisions

From Helpful
Jump to navigation Jump to search
mNo edit summary
Tag: New redirect
 
Line 1: Line 1:
 
#redirect [[Some_explanation_to_some_errors_and_warnings#INFO:_task_blocked_for_more_than_120_seconds.]]
Under heavy IO load on servers you may see something like:
INFO: task nfsd:2252 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
...probably followed by a call trace that mentions your filesystem, and probably io_schedule and sync_buffer.
 
 
'''This message is not an error''', it's telling you that this process has not been scheduled on the CPU ''at all'' for 120 seconds, because it was in [[uninterruptable sleep]] state. {{comment|(The code behind this message sits in <tt>hung_task.c</tt> and was added somewhere around <tt>2.6.30</tt>. This is a kernel thread that detects tasks that stays in the [[D state]] for a while)}}
 
 
'''At the same time''', 120 real-world seconds is an ''eternity'' for the CPU, and most programs, and most users.
 
Not being scheduled for that long typically signals resource starvation, usually IO, often some disk API.
Which means you usually don't want to silence or ignore that message,
because you want to find out when and why this happened, and probably avoid it in the future.
 
 
The stack trace can help diagnose what it was doing. {{comment|(which is not so informative of the ''reason'' -
the named program is often the victim of another one misbehaving, though it is sometimes the culprit)}}
 
 
Reasons include
* the system is heavily [[swapping]], possibly to the point of [[trashing]], due to memory allocation issues
: could be any program
 
* the underlying IO system is very slow for some reason
:: I've seen mentions of this happening in VMs that share disks
 
* specific bugs (in kernel code, systemd) have caused this as a side effect
 
 
 
 
 
 
<!--
{{comment|(...though you can explicitly set <tt>sysctl_hung_task_panic</tt>, in which case your host is now panicked)}}
-->
 
 
 
 
Notes:
* if it happens constantly your IO system is slower than your IO use
 
* can happen '''to''' a process that was [[ionice]]d into the idle class,
: which means ionice is working as intended, because idle-class is meant as an extreme politeness thing. It just indicates something else is doing a consistent bunch of IO right now (for at least 120 seconds), and doesn't help find the actual cause
: e.g. [http://en.wikipedia.org/wiki/Locate_%28Unix%29 updatedb], which may be the recipient if it were ioniced
 
* if it happens only nightly, look at your cron jobs
 
* a [[trashing]] system can cause this, and then it's purely a side effect of program using too more memory than there is RAM
 
* being blocked by a desktop-class drive with bad sectors (because they retry for a long while)
 
 
* NFS seems to be a common culprit, probably because it's good at filling the writeback cache, something which implies blocking while writeback happens - which is likely to block various things related to the same filesystem. {{verify}}
 
* if it happens on a fileserver, you may want to consider spreading to more fileservers, or using a parallel filesystem
 
 
* tweaking the linux io scheduler for the device  may help  (See [[Computer_data_storage_-_General_%26_RAID_performance_tweaking#OS_scheduling]])
: if your load is fairly sequential, you may get some relief from using the <tt>noop</tt> io scheduler (instead of <tt>cfq</tt>) though note that that disables [[ionice]])
: if your load is relatively random, upping the queue depth may help
 
 
 
[[Category:Unices]]
[[Category:Warnings and errors]]

Latest revision as of 15:10, 14 July 2023