Difference between revisions of "INFO: task blocked for more than 120 seconds."

From Helpful
Jump to: navigation, search
m
m
Line 6: Line 6:
  
  
The code behind this sits in <tt>hung_task.c</tt> and is relatively new (added somewhere around <tt>2.6.30</tt>).
+
The code behind this sits in <tt>hung_task.c</tt> and was added somewhere around <tt>2.6.30</tt>.
It is a kernel thread that detects tasks that stays in the D state for a while, typically meaning it is waiting for IO.
+
This is a kernel thread that detects tasks that stays in the [[D state]] for a while, typically meaning it is waiting for IO.
It complains when it sees a process has been waiting on IO so long that the whole process has not been scheduled for 120 seconds (default).
+
  
 +
It complains when it sees a process has been waiting on IO so long that the whole process has not been scheduled for 120 seconds (default) and '''is not an error or crash.''' It is only meant as an indication of where the program was when it had to wait.
  
'''It is not a crash.''' That is, unless you have set <tt>sysctl_hung_task_panic</tt> {{comment|(in which case your host is now panicked)}} it's only meant as an indication of where the program was when it had to wait.
+
{{comment|(...though you can explicitly set <tt>sysctl_hung_task_panic</tt>, in which case your host is now panicked)}}
  
  
Line 17: Line 17:
  
 
Notes:
 
Notes:
* if it happens nightly, it's probably some cron job, and load from something like [http://en.wikipedia.org/wiki/Locate_%28Unix%29 updatedb].
+
* if it happens constantly your IO system is slower than your IO use
  
* most likely to happen to a process that was [[ionice]]d into the idle class, in which case this this message indicates expectable ioniceness behaviour when there is something else that does IO fairly continuously for at least 120 seconds.
+
* NFS seems to be a common culprit, probably because it's good at filling the writeback cache, something which implies blocking while writeback happens - which is likely to block various things related to the same filesystem. {{verify}}
 +
 
 +
* if it happens on a fileserver, you may want to consider spreading to more fileservers, or using a parallel filesystem
 +
 
 +
 
 +
 
 +
* if it happens only nightly, it's probably some cron job such as [http://en.wikipedia.org/wiki/Locate_%28Unix%29 updatedb]
 +
 
 +
* most likely to happen ''to'' a process that was [[ionice]]d into the idle class, in which case this this message just indicates that that is working (and another process is doing IO fairly continuously for at least 120 seconds)
 +
 
 +
 
 +
* a [[trashing]] system can easily cause this {{verify}}, but then it's sort of a secondar issue
  
* if not, this can easily mean your IO system is slower than your IO use -- often specifically caused by overhead, such as that from head seeking.
 
  
 
* tweaking the linux io scheduler for the device  may help  (See [[Computer_data_storage_-_General_%26_RAID_performance_tweaking#OS_scheduling]])
 
* tweaking the linux io scheduler for the device  may help  (See [[Computer_data_storage_-_General_%26_RAID_performance_tweaking#OS_scheduling]])
Line 27: Line 37:
 
** if it's relatively random upping the queue depth may help
 
** if it's relatively random upping the queue depth may help
  
* if it happens on a fileserver, you may want to consider spreading to more fileservers, or using a parallel filesystem
 
 
* NFS seems to be a common culprit, probably because it's good at filling the writeback cache, something which implies blocking while writeback happens - which is likely to block various things related to the same filesystem. {{verify}}
 
  
 
* I've seen this mention kjournald, when the underlying RAID array was itself blocked by a desktop drive with bad sectors.
 
* I've seen this mention kjournald, when the underlying RAID array was itself blocked by a desktop drive with bad sectors.

Revision as of 15:20, 18 February 2016

Under heavy IO load on servers you may see something like:

INFO: task nfsd:2252 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

...probably followed by a call trace that mentions your filesystem, and probably io_schedule and sync_buffer.


The code behind this sits in hung_task.c and was added somewhere around 2.6.30. This is a kernel thread that detects tasks that stays in the D state for a while, typically meaning it is waiting for IO.

It complains when it sees a process has been waiting on IO so long that the whole process has not been scheduled for 120 seconds (default) and is not an error or crash. It is only meant as an indication of where the program was when it had to wait.

(...though you can explicitly set sysctl_hung_task_panic, in which case your host is now panicked)



Notes:

  • if it happens constantly your IO system is slower than your IO use
  • NFS seems to be a common culprit, probably because it's good at filling the writeback cache, something which implies blocking while writeback happens - which is likely to block various things related to the same filesystem. (verify)
  • if it happens on a fileserver, you may want to consider spreading to more fileservers, or using a parallel filesystem


  • if it happens only nightly, it's probably some cron job such as updatedb
  • most likely to happen to a process that was ioniced into the idle class, in which case this this message just indicates that that is working (and another process is doing IO fairly continuously for at least 120 seconds)


  • a trashing system can easily cause this (verify), but then it's sort of a secondar issue



  • I've seen this mention kjournald, when the underlying RAID array was itself blocked by a desktop drive with bad sectors.