Oom kill

From Helpful
Revision as of 17:11, 26 March 2012 by Helpful (Talk | contribs)

Jump to: navigation, search

oom_kill refers to code in the linux kernel that starts killing processes when the system cannot find free pages fast enough for an allocation.

The easiest way to cause it is to allocate so much that the system runs completely out of memory, but there are other ways of triggering it (some of which are unusual).

(Some kernels/kernel configurations allow you to enable or disble this functionality. Other seem just have it, period. (I'm guessing all newer kernels have it configurable))

Killing sounds bad, but consider an OS can deal with completely running out of memory in roughly two sorts of ways:

  • denying all new memory allocations, until the scarcity stops. This isn't very useful because
    • if the cause is a flaky program - and it usually is - then the scarcity isn't going to stop
    • most programs do not actually check every memory allocation for success, which means programs will be crashing anyway. Even if they do such checks, there is often no useful reaction other than stopping the process. So in the best case, random applications will stop. In the worst case, your system will.
  • killing the memory-hungriest application to end the memory scarcity.
    • If the problem indeed is a single process that has gone haywire, the best option is to find and kill that process.
    • the assumption is that the system has had enough memory for normal operation up to now. With that assumption and the fact of current trouble, it's likely that the current trouble is a single process that is misbehaving or just misconfigured (e.g. pre-allocates more memory than you have).
    • this could misfire on badly configured systems (e.g. daemons configured to use all RAM, leaving nothing for incidental variation, and having no swap to catch it either)

Usually this is simply the process with the most allocated memory, though oom_kill is a little smarter than that.

Whether memory usage is caused by a bug or just by something wanting more memory than you have, most would consider it more useful to kill such a process (even if that is, say, a database server) than to let the system die -- especially since by this time, the system is probably swapping like crazy anyway, and with particularly badly behaved apps this could possibly happen right after booting.

You may want some way of watching the system logs for messages that things are getting killed, although basic system statistics will be a good indication through memory use and swapping details.

See also