Difference between revisions of "INFO: task blocked for more than 120 seconds."

From Helpful
Jump to: navigation, search
m
m
Line 23: Line 23:
 
* if it happens constantly your IO system is slower than your IO use
 
* if it happens constantly your IO system is slower than your IO use
  
* NFS seems to be a common culprit, probably because it's good at filling the writeback cache, something which implies blocking while writeback happens - which is likely to block various things related to the same filesystem. {{verify}}
+
* most likely to happen ''to'' a process that was [[ionice]]d into the idle class. Which means it's working, idle-class is meant as an extreme politeness thing. It just indicates something else is doing a bunch of IO right now (for at least 120 seconds)
 +
: e.g. [http://en.wikipedia.org/wiki/Locate_%28Unix%29 updatedb] (may be victim if it were ioniced, cause if not)
  
* if it happens on a fileserver, you may want to consider spreading to more fileservers, or using a parallel filesystem
+
* if it happens only nightly, look at your cron jobs
  
 +
* a [[trashing]] system can cause this, and then it's purely a side effect of one program using too much RAM
  
 +
* being blocked by a desktop-class drive with bad sectors (because they retry for a long while)
  
* if it happens only nightly, it's probably some cron job such as [http://en.wikipedia.org/wiki/Locate_%28Unix%29 updatedb]
 
  
* most likely to happen ''to'' a process that was [[ionice]]d into the idle class, in which case this this message just indicates that that is working (and another process is doing IO fairly continuously for at least 120 seconds)
+
* NFS seems to be a common culprit, probably because it's good at filling the writeback cache, something which implies blocking while writeback happens - which is likely to block various things related to the same filesystem. {{verify}}
  
 
+
* if it happens on a fileserver, you may want to consider spreading to more fileservers, or using a parallel filesystem
* a [[trashing]] system can easily cause this {{verify}}, but then it's sort of a secondar issue
+
  
  
 
* tweaking the linux io scheduler for the device  may help  (See [[Computer_data_storage_-_General_%26_RAID_performance_tweaking#OS_scheduling]])
 
* tweaking the linux io scheduler for the device  may help  (See [[Computer_data_storage_-_General_%26_RAID_performance_tweaking#OS_scheduling]])
** if your load is fairly sequential, you may get some relief from using the <tt>noop</tt> io scheduler (instead of <tt>cfq</tt> - though note that that disables [[ionice]])
+
: if your load is fairly sequential, you may get some relief from using the <tt>noop</tt> io scheduler (instead of <tt>cfq</tt>) though note that that disables [[ionice]])
** if it's relatively random upping the queue depth may help
+
: if your load is relatively random, upping the queue depth may help
 
+
 
+
* I've seen this mention kjournald, when the underlying RAID array was itself blocked by a desktop drive with bad sectors.
+
  
  

Revision as of 13:05, 2 August 2016

Under heavy IO load on servers you may see something like:

INFO: task nfsd:2252 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

...probably followed by a call trace that mentions your filesystem, and probably io_schedule and sync_buffer.


The code behind this sits in hung_task.c and was added somewhere around 2.6.30. This is a kernel thread that detects tasks that stays in the D state for a while (which typically meaning it is waiting for IO).

It complains when it sees a process has been waiting on IO so long that the whole process has not been scheduled for any CPU-time for 120 seconds (default).

This message is not an error, it is meant as an indication of where the program was when it had to wait. (which is only half informative - it's not unusual that the real IO load comes from another process)



Notes:

  • if it happens constantly your IO system is slower than your IO use
  • most likely to happen to a process that was ioniced into the idle class. Which means it's working, idle-class is meant as an extreme politeness thing. It just indicates something else is doing a bunch of IO right now (for at least 120 seconds)
e.g. updatedb (may be victim if it were ioniced, cause if not)
  • if it happens only nightly, look at your cron jobs
  • a trashing system can cause this, and then it's purely a side effect of one program using too much RAM
  • being blocked by a desktop-class drive with bad sectors (because they retry for a long while)


  • NFS seems to be a common culprit, probably because it's good at filling the writeback cache, something which implies blocking while writeback happens - which is likely to block various things related to the same filesystem. (verify)
  • if it happens on a fileserver, you may want to consider spreading to more fileservers, or using a parallel filesystem


if your load is fairly sequential, you may get some relief from using the noop io scheduler (instead of cfq) though note that that disables ionice)
if your load is relatively random, upping the queue depth may help