INFO: task blocked for more than 120 seconds.

From Helpful
Revision as of 15:44, 12 September 2012 by Helpful (Talk | contribs)

Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Under heavy IO load on servers you may see something like:

INFO: task nfsd:2252 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

...probably followed by a call trace that mentions your filesystem, and probably io_schedule and sync_buffer.

Don't worry about how serious such a trace looks, this message is purely informational (unless you set sysctl_hung_task_panic, in which case your host is now panicked), but still probably something you want to do something about.


The code for this sits in hung_task.c The code is relatively new (added somewhere around 2.6.30?). It is a kernel thread that detects tasks that stays in the D state for a while, basically meaning it is waiting for IO. It complains when it sees a process has been waiting on IO so long that the whole process has not been scheduled for 120 seconds (default).


Notes:

  • most likely to happen for a process that was ioniced into the idle class, in which case this this message indicates intended or at least expectable behaviour for that process under constant IO load
  • if not, this can easily mean your IO system is slower than your IO use -- often specifically caused by overhead, such as that from head seeking
  • tweaking the linux io scheduler for the device may help (See Computer hard drives#Drive_specifics)
    • if your load is fairly sequential, you may get some relief from using the noop io scheduler (instead of cfq
    • if it's relatively random upping the queue depth may help
  • if it happens nightly, it's probably some cron job, and load from something like updatedb.
  • if it happens on a fileserver, you may want to consider spreading to more fileservers, or using a parallel filesystem
  • NFS seems to be a common culprit, probably because it's good at filling the writeback cache, something which implies blocking while writeback happens - which is likely to block various things related to the same filesystem. (verify)