Virtual memory: Difference between revisions

Latest revision as of 18:06, 22 April 2024

The lower-level parts of computers

General: Computer power consumption · Computer noises

Memory: Some understanding of memory hardware · CPU cache · Flash memory · Virtual memory · Memory mapped IO and files · RAM disk · Memory limits on 32-bit and 64-bit machines

Unsorted: GPU, GPGPU, OpenCL, CUDA notes · Computer booting

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Intro

Swapping / paging; trashing

Overcommitting RAM with disk

On memory scarcity

"How large should my page/swap space be?"

Linux

Swappiness

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

oom_kill

oom_kill is linux kernel code that starts killing processes when there is enough memory scarcity that memory allocations cannot happen within reasonable time (because that usually means we are already trashing).

Killing processes sounds like a poor solution.

But consider that an OS can deal with completely running out of memory in roughly three ways:

deny all memory allocations until the scarcity stops.

This isn't very useful because

it will affect every program until scarcity stops

if the cause is one flaky program - and it usually is just one - then the scarcity may not stop

programs that do not actually check every memory allocation will probably crash.

programs that do such checks well may have no option but to stop completely (maybe pause)

So in the best case, random applications will stop doing useful things - probably crash, and in the worst case your system will crash.

delay memory allocations until they can be satisfied

This isn't very useful because

this pauses all programs that need memory (they cannot be scheduled until we can give them the memory they ask for) until scarcity stops

again, there is often no reason for this scarcity to stop

so typically means a large-scale system freeze (indistinguishable from a system crash in the practical sense of "it doesn't actually do anything")

killing the misbehaving application to end the memory scarcity.

This makes a bunch of assumptions that have to be true -- but it lets the system recover

assumes there is a single misbehaving process (not always true, e.g. two programs allocating most of RAM would be fine individually, and needs an admin to configure them better)

...usually the process with the most allocated memory, though oom_kill logic tries to be smarter than that.

assumes that the system has had enough memory for normal operation up to now, and that there is probably one haywire process (misbehaving or misconfigured, e.g. (pre-)allocates more memory than you have)

this could misfire on badly configured systems (e.g. multiple daemons all configured to use all RAM, or having no swap, leaving nothing to catch incidental variation)

Keep in mind that

oom_kill is sort of a worst-case fallback

generally

if you feel the need to rely on the OOM, don't.

if you feel the wish to overcommit, don't

oom_kill is meant to deal with pathological cases of misbehaviour

but even then might pick some random daemon rather than the real offender, because in some cases the real offender is hard to define

note that you can isolate likely offenders via cgroups now (also meaning that swapping happens per cgroup)

and apparently oom_kill is now cgroups-aware

oom_kill does not always save you.

It seems that if your system is trashing heavily already, it may not be able to act fast enough.

(and possibly go overboard once things do catch up)

You may wish to disable oom_kill when you are developing

...or at least equate an oom_kill in your logs as a fatal bug in the software that caused it.

If you don't have oom_kill, you may still be able to get reboot instead, by setting the following sysctls:

vm.panic_on_oom=1

and a nonzero kernel.panic (seconds to show the message before rebooting)

kernel.panic=10

Virtual memory: Difference between revisions

Latest revision as of 18:06, 22 April 2024

Contents

Intro

Swapping / paging; trashing

Overcommitting RAM with disk

On memory scarcity

"How large should my page/swap space be?"

Linux

Swappiness

oom_kill

Page faults

Copy on write

Glossary

Navigation menu

@@ Line 4: / Line 4: @@
-'Virtual memory' ended up doing a number of different things.
+===Intro===
-For the ''most'' part, you can explain those things separately.
+<!--
+{{comment|(Note: this is a broad-strokes introduction that simplifies and ignores a lot of historical evolution of how we got where we are and ''why'' - plus a bunch of which ''I know I don't know yet'')}}.
-===Intro===
+'Virtual memory' describes an abstraction that we ended up using for a number of different things.
-<!--
-{{comment|(Note: this is a broad-strokes introduction that simplifies and ignores a lot of historical evolution of how we got where we are and ''why'' - a bunch of which I know I don't know)}}.
+For the ''most'' part, you can explain those reasons separately,
+though they got entangled over time (in ways that ''mostly'' operating system programmers need to worry about).
+At low level, memory access is "set an address, do a request, get back result".
+In olden times,
+this described hardware that did nothing more than that {{comment|(in some cases you even needed to do that yourself: set a value on address pins, flip the pin that meant a request, and read out data on some other pins)}},
+and the point is that there was ''nothing'' that keeps you from doing any request you want.
+Because you all used the same memory space, memory management was a... cooperative thing where everything needed to play nice.
+But that was hard, and beyond conventions to what parts were operating system and you wouldn't touch,
+there were no standards to multiple processes running concurrently, unless they actively knew about each other.
-In olden times, everyone could access all memory directly. You all used the same memory space so memory management was a more cooperative thing -- which is a pain and one of various reasons you would run one thing at a time (with few exceptions).
+Which was fine, because multitasking wasn't a buzzword yet.
+We ran one thing at a time, and the exceptions to that were clever about how they did that.
-We now typically have a a '''virtual memory system''',
+To skip a ''lot'' of history {{comment|(the variants on the way are a mess to actually get into)}},
-where running code never deals ''directly'' with physical addresses.
+what we have now is a '''virtual memory system''',
+where
+* each task gets its own address space.
+* there is something managing these assignments parts of memory to tasks
+* our running code ''never'' deals ''directly'' with physical addresses.
+* and when a request is made, ''something'' is doing translation between the addresses that the program sees, and the physical addresses and memory that actually goes to.
-It just means there's something inbetween - mostly there to be a little cleverer for you.
+{{comment|The low level implementation is also interesting, in that there there is hardware assisting this setup - things would be terribly slow if it weren't.  At the same time, these details are also largely irrelevant, in that it's always there, and fully transparent even to programmers)}}
-So now, each task gets its own address space.
-and ''something'' is doing translation between the addresses that the program sees, and the physical addresses and memory that actually goes to.
-The low level implementation (like the fact that hardware is actually assisting this) is ''interesting'', but usually less relevant.
+There are a handful of reasons this addresses-per-task idea is useful.
+One of them is just convenience.
+If the OS tells you where to go,
+you avoid overwriting other tasks accidentally.
-No matter the addresses used within each task, they can't clash in physical memory (or rather, ''won't'' overlap until you ask for it an the OS specifically allows it - see [[shared memory]]).
+Arguably the more important one is '''protected memory''':
+if that lookup can easily say "that was never allocated to you, ''denied''",
+meaning a task can never accidentally ''or'' intentionally access memory it doesn't own.
+{{comment|(There is no overlap in ownership until this is intentional, you specifically ask for it, and the OS specifically allows it - a.k.a. [[shared memory]].)}}
-There are a handful of reasons this can be useful.
+This is useful for stability,
+in that a user task can't bring down a system task accidentally,
+as was easy in the "everyone can trample over everyone" days.
-The larger among these ideas is '''protected memory''': that lookup can easily say "that is not allocated to you, ''denied''", meaning a task can never accidentally access memory it doesn't own.
+Misbehaving tasks will ''probably'' fail in isolation.
-This is useful for stability, in that a user task can't bring down a system task accidentally. Misbehaving tasks will fail in isolation.
+It's also great for security,
+in that tasks can't ''intentionally'' access what any other task is doing.
-It's also great for security, in that tasks can't reach out what others are doing ''intentionally'' either.
-{{comment|(Note that you can have protection ''without'' virtual addresses, if you keep track of what belongs to a task.
+{{comment|(Note that you can have protection ''without'' virtual addresses, if you keep track of what belongs to a task ''without'' also adding a translation step.
 You can also have virtual addresses without protection.   A few embedded systems opt for this because it can be a little simpler (and a little faster) without that an extra step of work or indirection. Yet on general purpose computes you want and get both.)}}
-Another reason is that processes (and most programmers) don't have to think about other tasks, the OS, or their management.
+Another reason is that processes (and most programmers) no longer have to think about other tasks, the OS, or their management.
 There are other details, like
-* having this be a separate system means the OS can unify underlying changes over time, and abstract out the changing ways hardware implements it (with extensions and variations even in the same CPU architecture/family).
+* having this be a separate system means the OS can e.g. change how their memory system works, without programs ever having to change in any way.
+: this matters to changes in how hardware implements it (with extensions and variations even in the same CPU architecture/family), but also allows us to work very similarly on dissimilar hardware
 * it can make fragments of RAM look contiguous to a process, which makes life much easier for programmers, and has only a small effect on speed (because of the RA in RAM).
 : generally the VMM does try to minimise fragmentation where possible, because too much can trash the fixed-size TLB
@@ Line 132: / Line 160: @@
 ===Swapping / paging; trashing===
 <!--
-Swapping is often understood as "things that could be in RAM are now being moved to disk".
+'''Swapping''' is often understood as "things that could be in RAM are now on disk",
+'''swapping in''' and '''swapping out''' as moving it to where it needs to go according to plan.
+{{comment|Side note: What windows usually calls paging, unixen usually call swapping.  In broad descriptions you can treat them the same. Once you get into the details the workings and terms do vary, and precise use becomes more important.)}}
 Yet there are a few other meanings, and the dinstinctions can be important.
 There are a few reasons a system might swap
-host is out of free RAM
+* kernel decided some part of RAM was never accessed, and decide to reserve swap space rather than RAM for it
+:: this amounts to just bookkeeping, rather than moving things around
-cgroup is out of RAM free too it
+* kernel decided some part of RAM is rarely accessed, and decides to move it to disk
-swappiness says it's getting close
+* host is out of free RAM, and we're looking for the least-used RAM to move to disk
-overcommit_ratio
+* cgroup is out of RAM; similar idea
+* swappiness says it's getting close
+* overcommit_ratio
-something is inactive (or was never touched since allocation)
-: this amounts to bookkeeping, rather than moving things around
 -->
+====Overcommitting RAM with disk====
+<!--
+As mentioned, swapping/paging has the effect that
+the VMM can have a pool of virtual memory that could be backed from RAM ''and'' from disk.
+''"Can you choose to map or allocate more total memory than would all fit into RAM at the same time?"''
+Yes.
+And a small degree of this is even ''common''.
+Using disk for memory seems like a bad idea, because disks are significantly slower than RAM in both bandwidth and latency.
+''Especially'' with platter days, but is still true in the SSD days.
+Which is why the VMM will always prefer to use RAM when it has it.
+This "...and also disk" can be considered overcommit of RAM,
+though note this is ''not'' the only meaning the term overcommit (or even the usual one), see below.
-===Page faults===
-<!--
+There are a few reasons it can make sense:
-Consider that in a system with a VMM, applications only ever deal in virtual addresses; it is the VMM that implies translation to real backing storage.
+* there is often some percentage of each program's memory that is inactive
+: think "was copied in when starting, and then never accessed in days, and possibly never will be"
+: if the VMM may adaptively move that to disk (where it will still be available if requested, just slower), that frees up more RAM for ''active'' programs (or caches) to use.
+: not doing this means a percentage of RAM would always be entirely inactive
+: doing this means slow access whenever you ''did'' need that memory after all
+: this means a small amount of swap space is almost always beneficial
+: it also doesn't make that much of a difference, because most things that are allocated have a purpose. '''However'''...
-And when doing memory access, one of the options is this access makes sense (the poage is known, and considered accessible) but cannot be accessed in the lightest sort of pass-through-to-RAM way.
-A page fault, widely speaking, means "instead of direct access, the kernel needs to decide what to do now".
+* there are programs that blanket-allocate RAM, and will never access part/most of it even once.
+: as such, the VMM can choose to not back allocated with anything, until the first use.
+:: this mostly just saves some up-front work
+: separately, there is a choice of how to count this not-yet used memory
+:: you could choose to not count that memory at all - but that's risky, and vague
+:: usually it counts towards  ''swap/page'' area  (often without ''any'' IO there)
+: this means a ''bunch'' of swap space can be beneficial, even if just for bookkeeping without ever writing to it
+:: just to not have to give out memory we never used
+:: while still actually having backing if they ever do.
-That signalling is called a page fault {{comment|(Microsoft also uses the term 'hard fault')}}.
-Note that it's a ''signal'', caught by the OS kernel. It's called 'fault' only for historical low-level design reasons.
+And yes, in theory neither would be necessary if programs behaved with perfect knowledge of other programs, of the system, and of how their data gets used.
+In practice this usually isn't feasible, so it makes sense to do this at OS-level basically with a best-guess implementation.
-This can mean one of multiple things. Most said ground is covered by the following cases:
+In most cases it has a mild net positive effect, largely because both above reasons mean there's a little more RAM for active use.
-'''Minor page fault''', a.k.a. '''soft page fault'''
+Yes, it is ''partly'' circular reasoning, in that programmers now get lazy doing such bulk allocations knowing this implementation, thereby ''requiring'' such an implementation.
-: Page is actually in RAM, but not currently marked in the MMU page table (often due to its limited size{{verify}})
+Doing it this way has become the most feasible because we've gotten used to thinking this way about memory.
-: resolved by the kernel updating the MMU, ''then'' just allowing the access.
-:: No memory needs to be moved around.
-:: Very little extra latency
-: (can happen around shared memory, or around memory that has been unmapped from processes but there had been no cause to delete it just yet - which is one way to implement a page cache)
-'''Major page fault''', a.k.a. '''hard page fault'''
-: memory is mapped, but not currently in RAM
-: i.e. mapping on request, or loading on demand -- which how you can do overcommit and pretend there is more memory (which is quite sensible where demand rarely happens)
-: resolved by the kernel finding some space, and loading that content. Free RAM, which can be made by swapping out another page.
-:: Adds noticeable storage, namely that of your backing storage
-:: the latter is sort of a "fill one hole by digging another" approach, yet this is only a real problem (trashing) when demand is higher than physical RAM
+Note that neither reason should impact the memory that programs actively use.
-'''Invalid page fault'''
+Moving inactive memory to disk will also rarely slow ''them'' down.
-* memory isn't mapped, and there cannot be memory backing it
+Things that periodically-but-very-infrequently do a thing may need up to a few extra seconds.
-: resolved by the kernel raising a [[segmentation fault]] or [[bus error]] signal, which terminates the process
+There is some tweaking to such systems
+* you can usualy ask for RAM that is never swapped/paged. This is important if you need to guarantee it's always accessed within very low timespan (can e.g. matter for real-time music production based on samples)
-DEDUPE WITH ABOVE
+* you can often tweak how pre-emptive swapping is
+: To avoid having to swap things to disk during the next memory allocation request, it's useful to do so pre-emptively, when the system isn't too busy.
+: this is usually somewhat tweakable
-or not currently in main memory (often meaning swapped to disk),
-or it does not currently have the backing memory mapped{{verify}}.
-Depending on case, these are typically resolved either by
-* mapping the region and loading the content.
-: which makes the specific memory access significantly slower than usual, but otherwise fine
-* terminating the process
+'''Using more RAM than you have'''
-: when it failed to be able to actually fetch it
+The above begs the question what happens when you attempting to actively to use more RAM than you have.
+This is a problem with ''and'' without swapping, with and without overcommit.
-Reasons and responses include:
-* '''minor page fault''' seems to includes:{{verify}}
-** MMU was not aware that the page was accessible - kernel inform it is, then allows access
-** writing to copy-on-write memory zone - kernel copies the page, then allows access
-** writing to page that was promised by the allocated, but needed to be - kernel allocates, then allows access
-* mapped file - kernel reads in the requested data, then allows access
+Being out of memory is a pretty large issue. Even the simplest "use what you have, deny once that's gone" system would have to just deny allocations to programs.
-* '''major page fault''' refers to:
+Many programs don't check every allocation, and may crash if not actually given what they ask for.   But even if they handled denied allocation perfectly elegantly, in many cases the perfect behaviour still often amount to stopping the program.
-** swapped out - kernel swaps it back in, then allows access
-* '''invalid page fault''' is basically
-** a [[segmentation fault]] - send SIGSEGV (default SIGSEGV hanler is to kill the process)
+Either way, the computer is no longer able to do what you have.
-Note that most are not errors.
+And there is an argument that it is preferable to have it continue, however slowly,
+in the hope this was some intermittent bad behaviour that will be solved soon.
-In the case of a [[memory mapped IO]], this is the designed behaviour.
+When you overcommit RAM with disk, this happens somewhat automatically.
+And it's slow as molasses, because some of the actively used memory is now going not via microsecond-at-worst RAM but millisecond-at-best disk.
-Minor will often happen regularly, because it includes mechanisms that are cheap, save memory, and thereby postpone major page faults.
-Major ideally happens as little as possibly, because memory access is delayed by disk IO.
+While there are cases that are less bad, it's doing this ''continuously'' instead of sporadically.
--->
+This is called '''trashing'''. If your computer suddenly started to continously rattle its disk while being verrry slow, this is what happened.
-See also
-* http://en.wikipedia.org/wiki/Paging
-* http://en.wikipedia.org/wiki/Page_fault
+{{comment|(This is also the number one reason why adding RAM may help a lot for a given uses -- or not at all, if this was not your problem.)}}
-* http://en.wikipedia.org/wiki/Demand_paging
-===Swappiness===
-{{stub}}
+===Overcommitting (or undercommitting) virtual memory, and other tricks===
 <!--
-The aggressiveness with with an OS swap out allocated-but-inactive pages to disk is often controllable.
-Linux dubs this ''swappiness''.
+Consider we have a VMM system with swapping, i.e.
-Higher swappiness mean the tendency to swap out is higher. {{comment|(other information is used too, including the currently mapped ratio, and a measure of how much trouble the kernel has recently had freeing up memory)}}{{verify}}
+* all of the actively used virtual memory pages are in RAM
+* infrequently used virtual memory pages are on swap
+* never-used pages are counted towards swap {{comment|(does ''not'' affect the ammount of allocation you can do in total)}}
+Overcommit is a system where the last point can instead be:
+* never-used pages are nowhere.
+'''More technically'''
+More technically, overcommit allows allocation of address space, without allocating memory to back it.
+Windows makes you do both of those explicitly,
+implying fairly straightforward bookkeeping,
+and that you cannot do this type of overcommit.
+{{comment|(note: now less true due to compressed memory{{verify}})}}
-Swapping out is always done with cost/benefit considerations.
+Linux implicitly  allows that separation,
+basically because the kernel backs allocations only on first use {{comment|(which is also why some programs will ensure they are backed by something by storing something to all memory they allocate)}}.
-The cost is mainly the time spent,
+Which is separate from overcommit; if overcommit is disabled is merely saves some initialisation work.
-the benefit giving RAM to caches, and to programs (then also doing some swapping now rather than later).
+But with overcommit (and similar tricks, like OSX's and Win10's compressed memory, or linux's [[zswap]]) your bookkeeping becomes more flexible.
-(note that linux swaps less aggressive than windows to start with, at least with default settings)
+Which includes the option to give out more than you have.
-There are always pages that are inactive simply because programs very rarely use it (80/20-like access patterns).
+'''Why it can be useful'''
-But with plenty of free free RAM it might not even swap ''those'' out, because benefit is so low.
+Basically, when there may be a good reason pages will ''never'' be used.
-I had 48GB and 256GB workstations at work and people rarely got them to swap ''anything''.
+The difference is that without overcommit this still needs to all count towards something (swap, in practive), but and that overcommit means the bookkeeping assumes you will always have a little such used-in-theory-but-never-practice use.
+How wise that is depends on use.
-It's a gliding scale. To illustrate this point, consider the difference between:
-* using more RAM than we have - we will probably swap in response to every allocation
+There are two typical examples.
-: or worse, in the case of trashing: we are swapping purely to avoid crashing the system
-: Under high memory strain, cost of ''everything'' is high, because we're not swapping to free RAM for easier future use, we're swapping to not crash the system.
-* Swapping at any other time is mainly about pro-actively freeing up RAM for near-future use.
+One is that a large process may fork().
-: being IO we otherwise have to concentrate to during the next large allocation request
+In a simple implementation you would need twice the memory,
-:: arguing for ''higher'' swappiness, because it will effectively spread that work over time,
+but in practice the two forks' pages are copy-on-write, meaning they will be shared until written to.
+Meaning you still need to do bookkeeping in case that happens, but even if it's another worker it probably won't be twice.
+In the specific case where it wasn't for another copy of that program, but to instead immediately exec() a small helper program, that means the pages will ''never'' be written.
-These are entirely different cases.
-* The former clobbers caches, the latter builds it up
-* the former ''is'' memory strain, the latter ''may'' lessen it in the future
+The other I've seen is mentioned in the kernel docs: scientific computing that has very large, very sparse arrays.
-: (if the peak use is still sensible, and won't trash itself along with everything else)
+This is essentially said computing avoiding writing their own clever allocator, by relying on the linux VMM instead.
-Arguments for '''lower''' swappiness:
+Most other examples arguably fall under "users/admins not thinking enough".
-* Delays putting things on slower disk until RAM is necessary for something else
+Consider the JVM, which has its own allocator which you give an initial and max memory figure at startup.
-** ...avoiding IO (also lets drives spin down, which can matters to laptop users)
+Since it allocates memory on demand (also initialises it{{verify}}), the user may ''effectively'' overcommit by having the collective -Xmx be more than RAM.
-** (on the flipside, when you want to allocate memory ''and'' the system needs to swap out things first to provide that memory, it means more work, IO, and sluggishness concentrated at that time)
+That's not really on the system to solve, that's just bad setup.
-* apps are more likely to stay in memory (particularly larger ones). Over-aggressive swapout (e.g. inactivity because you went for coffee) is less likely, meaning it is slightly less likely that you have to wait for a few seconds of churning swap-in when you continue working
-: not swapping out GUI programs makes them ''feel'' faster even if they don'
-* When your computer has more memory than you actively use, there will be less IO caused by swapping inactive pages out and in again (but there are other factors that ''also'' make swapping less likely in such cases)
-Arguments for '''higher''' swappiness seem to include{{verify}}:
+'''Critical view'''
-* When there is low memory pressure, caches is what makes (repetitive) disk access faster.
-* keeps memory free
+Arguably, having enough swap makes this type of overcommit largely unnecessary, and mainly just risky.
-** spreads swap activity over time, useful when it is predictably useful later
-** free memory is usable by the OS page cache
-* swapping out rarely used pages means new applications and new allocations are served faster by RAM
+The risk isn't too large, because it's paired with heuristics that disallow silly allocations,
-: because it's less likely we have to swap other things out at allocation time
+and the oom_killer that resolves most runaway processes fast enough.
-* allocation-greedy apps will not cause swapping so quickly, and are served more quickly themselves
+It's like [https://en.wikipedia.org/wiki/Overselling#Airlines overselling aircraft seats], or [https://en.wikipedia.org/wiki/Fractional-reserve_banking fractional reserve banking].
+It's a promise that is ''less'' of a promise, it's fine (roughly for the same reasons that systems that allow swapping are not continuously trashing), but once your end users count on this, the concept goes funny, and when everyone comes to claim what's theirs you are still screwed.
-'''On caches'''
-Swappiness applies mostly to process's memory, and not to kernel constructs like the OS page cache, dentry cache, and inode cache.
+Note that windows avoids the fork() case by not having fork() at all (there's no such cheap process duplication, and in the end almost nobody cares).
-That means that swapping things out increases the amount of OS page cache we have.
+Counterarguments to overcommit include that system stability should not be based on bets,
+that it is (ab)use of an optimization that you should not be counting on,
+that programs should not be so lazy,
+and that we are actively enabling them to be lazy and behave less predictably,
+and now sysadmins have to frequently figure out why that [[#oom_kill|oom_kill]] happened.
-From a perspective of data caching, you can see swappiness as one knob that (somewhat indirectly) controls how likely data will sit in a process, OS cache, or swapped out.
+Yet it is harder to argue that overcommit makes things less stable.
+Consider that without overcommit, memory denials are more common (and that typically means apps crashing).
+With or without overcommit, we are currently already asking what the system's emergency response should be (and there is no obvious answer to "what do we sacrifice first") because improper app behaviour is ''already a given''.
-Consider for example the case of large databases (often following some 80/20-ish locality patterns).
-If you can make the database cache data in its own process memory, you may want lower swappiness, since that makes it more likely that needed data is still in memory.
+Arguably oom_kill ''can'' be smarter, usually killing only an actually misbehaving program.
+Rather than a denial probably killing the next program (more random).
+But you don't gain much reliability either way.
-If you ''disable'' that in-process caching of tables, then might get almost the same effect, because the space freed is instead left of the OS page cache, which may then store all the file data you read most - which can be entirely the same thing (if you have no other major programs on the host).
+{{comment|(In practice oom_kill can take some tweaking, because it's still possible that e.g. a mass of smaller
+programs lead to the "fix" of your big database getting killed)}}
-{{comment|(In some cases (often  mainly 'when nothing else clobbers it'), the OS page cache is a simple and great solution. Consider how a file server will automatically focus on the most common files, transparently hand it to multiple processes, etc.
-Sure, for some cases you design something smarter, e.g. a LRU memcache.
-And of course this cache is bad to count on when other things on the server start vying for the same cache (and clobbering it as far as you're concerned).
+'''So is it better to disable it?'''
-This also starts to matter when you fit a lot of different programs onto the same server so they start vying for limited memory.
+No, it has its benificial cases, even if they are not central.
+Disabling also won't prevent swapping or trashing,
+as the commit limit is typically still > RAM {{comment|(by design, and you want that. Different discussion though)}}.
+But apps shouldn't count on overcommit as a feature, unless you ''really'' know what you're doing.
-'''Server versus workstation'''
+Note that if you want to keep things in RAM, you probably want to lower [[#swappiness|swappiness]]) instead.
-There is some difference between server and workstation.
-Or rather, a system that is more or less likely to touch on the same data repeatedly,
+'''Should I tweak it?'''
-hence value caches. A file server typically will, other servers frequently will.
+Possibly.
-Desktop tends to see relatively random disk access so cache doesn't matter much.
+Linux has three modes:
+* overcommit_memory=2: No overcommit
+: userspace commit limit is swap + fraction of ram
+: if that's &lt;RAM, the rest is only usable by the kernel, usually mainly for caches (which can be a useful mechanism to dedicate some RAM to the [[page cache]])
-Instead, you may care to avoid GUI program swapped out much,
+* overcommit_memory=1: Overcommit without checks/limits.
-by having ''nothing'' swap out even when approaching memory pressure.
+: Appropriate for relatively few cases, e.g. the very-space array example.
+: in genera just more likely to swap and OOM.
-This seems like micromanaging for a very specific case (you're off as badly at actual memory pressure, and off as well when you have a lot of free RAM), but it might sometimes apply.
+* overcommit_memory=0: Overcommit with heuristic checks (default)
+: refuses large overcommits, allows the sort that would probably reduce swap usage
-'''Actual tweaking'''
+These mainly control the maximum allocation limit for userspace programs.
+This is still a fixed number, and still ''related'' to the amount of RAM, but the relation can be more interesting.
+On windows it's plainly what you have:
+ swap space + RAM
-There is also:
+and linux's it's:
-* vm.swappiness -
+ swap space + (RAM * (overcommit_ratio/100) )
-* vm.vfs_cache_pressure -
+or, if you instead use overcommit_kb,
+ swap space + overcommit_kb {{verify}}
+Also note that 'commit_ratio' might have been a better name,
+because it's entirely possible to have that come out as ''less' than RAM, undercommit if you will.
+This undercommit is also a sort of feature, because while that keeps applications from using it,
+this ''effectively'' means it's dedicated to (mainly) kernel cache and buffers.
-In linux you can use proc or sysctl to check and set swappiness
+Note that the commit limit is ''how much'' it can allocate, not where it allocates from (some people assume this based on how it's calculated).
- cat /proc/sys/vm/swappiness
- sysctl vm.swappiness
-...shows you the current swappiness (a number between 0 and 100), and you can set it with something like:
- echo 60 >  /proc/sys/vm/swappiness
- sysctl -w vm.swappiness=60
+Yet if the commit limit is less than total RAM, applications will never be able to use all RAM.
+This may happen when you have a lot of RAM and/or very little swap.
+Because when you use overcommit_ratio (default is 50), the value (and sensibility of the) of the commit limit essentially depends on the ''ratio'' between swap space and RAM.
+Say,
+: 2GB swap, 4GB RAM, overcommit_ratio=50: commit limit at (2+0.5*4) = 4GB.
+: 2GB swap, 16GB RAM overcommit_ratio=50: (2+0.5*16) = 10GB.
+: 2GB swap, 256GB RAM overcommit_ratio=50: (2+0.5*256) = 130GB.
+: 30GB swap, 4GB RAM overcommit_ratio=50: (30+0.5*4) = 32GB.
+: 30GB swap, 16GB RAM overcommit_ratio=50: (30+0.5*16) = 38GB.
+: 30GB swap, 256GB RAM overcommit_ratio=50: (30+0.5*256) = 156GB.
+So
+* and/or (more so if you have a lot of RAM) you may consider setting overcommit_ratio higher than default
+: possibly close to 100% {{comment|(or use overcommit_kb instead because that's how you may be calculating it anyway)}}
+: and/or more swap space.
-This is '''not'' a percentage, as some people think. It's a fudgy value, and hasn't meant the same thing for all iterations of the code behind this.
+* if you desire to leave some dedicate to caches (which is a good idea) you have to do some arithmetic.
+: For example, witjh 4GB swap and 48GB RAM,
+:: you need ((48-4)/48=) ~91% to cover RAM,
+:: and ((48-4-2)/48=) ~87% to leave ~2GB for caches.
-Some kernels do little swapping for values in the range 0-60 (or 0-80, but 60 seems the more common tipping point).
+* this default is why people suggest your swap area should be roughly as large as your RAM (same order of magnitude, anyway)
-It seems gentler tweaking is in the 20-60 range.
-A value of 100 or something near it tends to make for very aggressive swapping.
-* 0 doesn't disable, but should be pretty rare until memory pressure (which probably makes oom_kill likelier to trigger)
-* Close to 100 is very aggressive.
-is enabled by very light
+'''Should I add more RAM instead?'''
-up to 10
+Possibly. It depends on your typical and peak load.
+More RAM improves performance noticeably only when it avoids swapping under typical load.
-Note that the meaning of the value was never very settled, and has changed with kernels versions {{comment|(for example, (particularly later) 2.6 kernels swap out more easily under the same values than 2.4)}}.
+It helps little beyond that. It helps when it means files you read get cached (see [[page cache]]),
+but beyond that has ''no'' effect.
-* If you swap to SSD, you might lower swappiness to make it live longer
-: but memory use peaks will affect it more than swappiness
+Other notes:
+* windows is actually more aggressive about swapping things out - it seems to favour favour of IO caches
+* linux is more tweakable (see [[#swappiness|swappiness]]) and by default is less aggressive.
-People report that
+* overcommit makes sense if you have significant memory you reserve but ''never'' use
-* interaction with a garbage collector (e.g. JVM's) might lead to regular swapping
+: which is, in some views, entirely unnecesssary
-: so argue for lower swappiness
+: it should probably be seen as a minor optimization, and not a feature you should (ab)use
-* servers:
-: 10 may ''may'' make sense e.g. on database servers to focus on caches
-: on a dedicated machine, if what you keep in apps may instead be in OS cache it may matter little
-:
-* desktops:
-: around &le;10 starts introducing choppiness and pauses (probably because it concentrates swapping IO to during allocation requests)
+Unsorted notes
+* windows puts more importance on the swap file
-* VMs make things more interesting
+* you don't really want to go without swap file/space on either windows or linux
+: (more so if you turn autocommit off on linux)
-* containers too make things more interesting
+* look again at that linux equation. That's ''not'' "swap plus more-than-100%-of-RAM"
+: and note that if you have very little swap and or tons of RAM (think >100GB), it can mean your commit limit is lower than RAM
+* swap will not avoid oom_kill altogether - oom_kill is triggered on low speed of freeing pages {{verify}}
+-->
+<!--
 See also:
-* http://lwn.net/Articles/83588/
+* https://serverfault.com/questions/362589/effects-of-configuring-vm-overcommit-memory
+* https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
+* https://www.win.tue.nl/~aeb/linux/lk/lk-9.html
+* http://engineering.pivotal.io/post/virtual_memory_settings_in_linux_-_the_problem_with_overcommit/
+* https://serverfault.com/questions/362589/effects-of-configuring-vm-overcommit-memory
+-->
+====On memory scarcity====
+<!--
+On a RAM-only system you will find you at some point cannot find free pages.
+When you've added swap and similar features,
+you may find your bookkeeping says it can be done,
+but in practice it will happen very slowly.
+Also, having disconnected programs from the backing store,
+only the kernel can even guess at how bad that is.
+The most obvious case is more pages being actively used than there is physical RAM (can happen without overcommit, more easily with), but there are others. Apparently things like hot database backups may create so many [[dirty pages]] so quickly that the kernel decides it can't free anywhere near fast enough.
+In a few cases it's due to a sudden (reasonable) influx of dirty pages, but otherwise transient.
-* https://lwn.net/Articles/690079/
+But in most cases scarcity is more permanent, means we've started swapping and probably [[trashing]], making everything slow.
-* https://askubuntu.com/questions/184217/why-most-people-recommend-to-reduce-swappiness-to-10-20
+Such scarcity ''usually'' comes from a single careless / runaway,
+sometimes just badly configured (e.g. you told more than one that they could take 80% of RAM), sometimes from a slew of (probably-related) programs.
 -->
-===Practical notes===
-====Linux====
 <!--
+=====SLUB: Unable to allocate memory on node=====
-It seems that *nix swapping logic is smart enough to do basic RAID-like spreading among its swap devices, meaning that a swap partition on every disks that isn't actively used (e.g. by by something important like a database) is probably useful.
+SLUB is [[slab allocation]], i.e. about dynamic allocation of kernel memory
-Swap used for hibernation can only come from a swap partition, not a swap file {{comment|(largely because that depends too much on whether the underlying filesystem is mounted)}}.
+This particular warning seems most related to a bug in memory accounting.
-Linux allows overcommit, but real world cases vary.
+It seems more likely to happen around containers with cgroup kmem accounting,
-It depends on three things:
+(not yet stable in 3.x, and apparently there are still footnotes in 4.x)
-* swap space
+but happens outside as well?
-* RAM size
-* overcommit_ratio (defaults to 50%)
-When swap space is, say, half of RAM size,
-On servers/workstations with at least dozens of GBs of RAM,
+There was a kernel memory leak
-this will easily mean overcommit_ratio should be 80-90 for userspace to be able to use most/all RAM.
-If the commit limit is lower than RAM, the rest goes (mostly) to caches and buffers.
-Which, note is often useful, and sometimes it can even be preferable to effectively have a little dedicated cache.
 -->
@@ Line 475: / Line 598: @@
 <!--
-Depends on use.
+There used to be advice like "Your swap file needs to be 1.5x RAM size", and tables to go along.
+The tables's values varying wildly shows just how arbitrary this is.
+That they are usually 20 years old more so.
+It depends significantly with the amount of RAM you have.
+But also, it just depends on use.
 Generally, the better answer is to consider:
@@ Line 488: / Line 621: @@
 : less than a GB in most cases, and a few GB in a few
-* too much swap doesn't really hurt
+* unused swap space doesn't really hurt (other than in disk space)
@@ Line 499: / Line 632: @@
 The numbers tend to be bigger - but ideally also more predictable.
-And tweaking can make more sense because of it.
+Tweaking can make more sense because of it.
 For example, when two major services try to each use 70% of RAM for caches,
 they'll end up pushing each other to swap (and both incur disk latency),
 and you're better off halving the size of each,
 when that implies means never involving disk.
+And ideally you don't swap at all - having to swap something in means whatever's asking is now ''waiting''.
@@ Line 516: / Line 651: @@
-"Your swap file needs to be 1.5x RAM size"
-This is arbitrary.
-As shown by that number changing over time.
-I've even seen tables that also vary that factor significantly with RAM.
@@ Line 566: / Line 695: @@
 -->
-===On memory scarcity===
+====Linux====
+<!--
+It seems that *nix swapping logic is smart enough to do basic RAID-like spreading among its swap devices, meaning that a swap partition on every disks that isn't actively used (e.g. by by something important like a database) is probably useful.
+Swap used for hibernation can only come from a swap partition, not a swap file {{comment|(largely because that depends too much on whether the underlying filesystem is mounted)}}.
+Linux allows overcommit, but real world cases vary.
+It depends on three things:
+* swap space
+* RAM size
+* overcommit_ratio (defaults to 50%)
+When swap space is, say, half of RAM size,
+On servers/workstations with at least dozens of GBs of RAM,
+this will easily mean overcommit_ratio should be 80-90 for userspace to be able to use most/all RAM.
+If the commit limit is lower than RAM, the rest goes (mostly) to caches and buffers.
+Which, note is often useful, and sometimes it can even be preferable to effectively have a little dedicated cache.
+-->
+=====Swappiness=====
+{{stub}}
 <!--
-On a RAM-only system you will find you at some point cannot find free pages.
+There is an aggressiveness with with an OS will swap out allocated-but-inactive pages to disk.
+This is often controllable.
+Linux calls this ''swappiness''.
+Higher swappiness mean the general tendency to swap out is higher.
+This general swappiness is combined with other (often more volatile) information,
+including the system's currently mapped ratio,
+a measure of how much trouble the kernel has recently had freeing up memory,
+and some per-process (per-page) statistics.
+Swapping out is always done with cost/benefit considerations.
+The cost is mainly the time spent,
+the benefit is largely giving more RAM to programs and caches (then also doing some swapping now rather than later).
+(note that linux swaps less aggressively than windows to start with - at least with default settings)
+There are always pages that are inactive simply because programs very rarely use it ([[80/20]]-like access patterns).
+But if you have plenty of free RAM it might not even swap ''those'' out, because benefit is estimated to be low.
+: I had 48GB and 256GB workstations at work and people rarely got them to swap ''anything''.
+It's a gliding scale. To illustrate this point, consider the difference between:
+* using more RAM than we have - we will probably swap in response to every allocation
+: or worse, in the case of trashing: we are swapping purely to avoid crashing the system
+: Under high memory strain, cost of ''everything'' is high, because we're not swapping to free RAM for easier future use, we're swapping to not crash the system.
+* Swapping at any other time is mainly about pro-actively freeing up RAM for near-future use.
+: being IO we otherwise have to concentrate to during the next large allocation request
+:: arguing for ''higher'' swappiness, because it will effectively spread that work over time,
+These are entirely different cases.
+* The former clobbers caches, the latter builds it up
+* the former ''is'' memory strain, the latter ''may'' lessen it in the future
+: (if the peak use is still sensible, and won't trash itself along with everything else)
+Arguments for '''lower''' swappiness:
+* Delays putting things on slower disk until RAM is necessary for something else
+** ...avoiding IO (also lets drives spin down, which can matters to laptop users)
+** (on the flipside, when you want to allocate memory ''and'' the system needs to swap out things first to provide that memory, it means more work, IO, and sluggishness concentrated at that time)
+* apps are more likely to stay in memory (particularly larger ones). Over-aggressive swapout (e.g. inactivity because you went for coffee) is less likely, meaning it is slightly less likely that you have to wait for a few seconds of churning swap-in when you continue working
+: not swapping out GUI programs makes them ''feel'' faster even if they don'
+* When your computer has more memory than you actively use, there will be less IO caused by swapping inactive pages out and in again (but there are other factors that ''also'' make swapping less likely in such cases)
+Arguments for '''higher''' swappiness seem to include{{verify}}:
+* When there is low memory pressure, caches is what makes (repetitive) disk access faster.
+* keeps memory free
+** spreads swap activity over time, useful when it is predictably useful later
+** free memory is usable by the OS page cache
+* swapping out rarely used pages means new applications and new allocations are served faster by RAM
+: because it's less likely we have to swap other things out at allocation time
+* allocation-greedy apps will not cause swapping so quickly, and are served more quickly themselves
+'''On caches'''
+Swappiness applies mostly to process's memory, and not to kernel constructs like the OS [[page cache]] (and [[dentry cache]], and [[inode cache]]).
+That means that swapping things out increases the amount of OS page cache we have.
+From a perspective of data caching, you can see swappiness as one knob that (somewhat indirectly) controls how likely data will sit in a process, OS cache, or swapped out.
+Consider for example the case of large databases (often following some 80/20-ish locality patterns).
+If you can make the database cache data in its own process memory, you may want lower swappiness, since that makes it more likely that needed data is still in memory.
+If you ''disable'' that in-process caching of tables, then might get almost the same effect, because the space freed is instead left of the OS page cache, which may then store all the file data you read most - which can be entirely the same thing (if you have no other major programs on the host).
+{{comment|(In some cases (often  mainly 'when nothing else clobbers it'), the OS page cache is a simple and great solution. Consider how a file server will automatically focus on the most common files, transparently hand it to multiple processes, etc.
+Sure, for some cases you design something smarter, e.g. a LRU memcache.
+And of course this cache is bad to count on when other things on the server start vying for the same cache (and clobbering it as far as you're concerned).
+This also starts to matter when you fit a lot of different programs onto the same server so they start vying for limited memory.
+'''Server versus workstation'''
+There is some difference between server and workstation.
+Or rather, a system that is more or less likely to touch on the same data repeatedly,
+hence value caches. A file server typically will, other servers frequently will.
+Desktop tends to see relatively random disk access so cache doesn't matter much.
+Instead, you may care to avoid GUI program swapped out much,
+by having ''nothing'' swap out even when approaching memory pressure.
+This seems like micromanaging for a very specific case (you're off as badly at actual memory pressure, and off as well when you have a lot of free RAM), but it might sometimes apply.
+'''Actual tweaking'''
+There is also:
+* vm.swappiness -
+* vm.vfs_cache_pressure -
+In linux you can use proc or sysctl to check and set swappiness
+ cat /proc/sys/vm/swappiness
+ sysctl vm.swappiness
+...shows you the current swappiness (a number between 0 and 100), and you can set it with something like:
+ echo 60 >  /proc/sys/vm/swappiness
+ sysctl -w vm.swappiness=60
+This is '''not'' a percentage, as some people think. It's a fudgy value, and hasn't meant the same thing for all iterations of the code behind this.
+Some kernels do little swapping for values in the range 0-60 (or 0-80, but 60 seems the more common tipping point).
+It seems gentler tweaking is in the 20-60 range.
+A value of 100 or something near it tends to make for very aggressive swapping.
+* 0 doesn't disable, but should be pretty rare until memory pressure (which probably makes oom_kill likelier to trigger)
+* Close to 100 is very aggressive.
+is enabled by very light
+up to 10
+Note that the meaning of the value was never very settled, and has changed with kernels versions {{comment|(for example, (particularly later) 2.6 kernels swap out more easily under the same values than 2.4)}}.
+* If you swap to SSD, you might lower swappiness to make it live longer
+: but memory use peaks will affect it more than swappiness
-When you've added swap and similar features,
-you may find your bookkeeping says it can be done,
-but in practice it will happen very slowly.
+People report that
+* interaction with a garbage collector (e.g. JVM's) might lead to regular swapping
+: so argue for lower swappiness
-Also, having disconnected programs from the backing store,
+* servers:
-only the kernel can even guess at how bad that is.
+: 10 may ''may'' make sense e.g. on database servers to focus on caches
+: on a dedicated machine, if what you keep in apps may instead be in OS cache it may matter little
+:
+* desktops:
+: around &le;10 starts introducing choppiness and pauses (probably because it concentrates swapping IO to during allocation requests)
-The most obvious case is more pages being actively used than there is physical RAM (can happen without overcommit, more easily with), but there are others. Apparently things like hot database backups may create so many [[dirty pages]] so quickly that the kernel decides it can't free anywhere near fast enough.
+* VMs make things more interesting
+* containers too make things more interesting
-In a few cases it's due to a sudden (reasonable) influx of dirty pages, but otherwise transient.
-But in most cases scarcity is more permanent, means we've started swapping and probably [[trashing]], making everything slow.
-Such scarcity ''usually'' comes from a single careless / runaway,
+See also:
-sometimes just badly configured (e.g. you told more than one that they could take 80% of RAM), sometimes from a slew of (probably-related) programs.
+* http://lwn.net/Articles/83588/
--->
+* https://lwn.net/Articles/690079/
+* https://askubuntu.com/questions/184217/why-most-people-recommend-to-reduce-swappiness-to-10-20
+-->
 =====oom_kill=====
-<tt>oom_kill</tt> is linux kernel code that starts killing processes when there is enough memory scarcity that memory allocations cannot happen within reasonable time - as this is good indication that it's gotten to the point that we are trashing.
+<tt>oom_kill</tt> is linux kernel code that starts killing processes when there is enough memory scarcity that memory allocations cannot happen within reasonable time (because that usually means we are already [[trashing]]).
@@ Line 655: / Line 992: @@
 * http://mirsi.home.cern.ch/mirsi/oom_kill/index.html
+===Page faults===
+<!--
+{{zzz|For context:
+Consider that in a system with a Virtual Memory Manager (VMM),
+applications only ever deal in {{comment|(what in the grander scheme turns out to be)}} virtual addresses;
+it is the VMM system's implementation that implies/does translation to real backing storage.
+So when doing any memory access, one of the tasks is making this access makes sense. For example, it may be that
+* the page is not known
+* the page is known, but not considered accessible
+* the page is known, considered accessible, and in RAM
+* the page is known, considered accessible, but cannot be accessed in the lightest sort of pass-through-to-RAM way.}}
+A '''page fault''', widely speaking, means "instead of direct access, the kernel needs to decide what to do now" - that last case {{comment|(some of the the others have their own names)}}.
+That signalling is called a page fault {{comment|(Microsoft also uses the term 'hard fault')}}.
+Note that it's a ''signal'' (caught by the OS kernel), and called a 'fault' only for historical, almost-electronic-level design reasons.
+A page fault can still mean one of multiple things.
+Most (but not even all) ground is covered by the following cases:
-<!--
+'''Minor page fault''', a.k.a. '''soft page fault'''
-=====SLUB: Unable to allocate memory on node=====
+: Page is actually in RAM, but not currently marked in the MMU page table (often due to its limited size{{verify}})
+: resolved by the kernel updating the MMU with the knowledge it needs, ''then'' probably allowing the access.
+:: No memory needs to be moved around.
+:: little extra latency
+: (can happen around shared memory, or around memory that has been unmapped from processes but there had been no cause to delete it just yet - which is one way to implement a page cache)
+'''Major page fault''', a.k.a. '''hard page fault'''
+: memory is mapped, but not currently in RAM
+: i.e. mapping on request, or loading on demand -- which how you can do overcommit and pretend there is more memory (which is quite sensible where demand rarely happens)
+: resolved by the kernel finding some space, and loading that content. Free RAM, which can be made by swapping out another page.
+:: Adds noticeable storage, namely that of your backing storage
+:: the latter is sort of a "fill one hole by digging another" approach, yet this is only a real problem (trashing) when demand is higher than physical RAM
+'''Invalid page fault'''
+* memory isn't mapped, and there cannot be memory backing it
+: resolved by the kernel raising a [[segmentation fault]] or [[bus error]] signal, which terminates the process
+DEDUPE WITH ABOVE
+or not currently in main memory (often meaning swapped to disk),
+or it does not currently have the backing memory mapped{{verify}}.
+Depending on case, these are typically resolved either by
+* mapping the region and loading the content.
+: which makes the specific memory access significantly slower than usual, but otherwise fine
+* terminating the process
+: when it failed to be able to actually fetch it
+Reasons and responses include:
+* '''minor page fault''' seems to includes:{{verify}}
+** MMU was not aware that the page was accessible - kernel inform it is, then allows access
+** writing to copy-on-write memory zone - kernel copies the page, then allows access
+** writing to page that was promised by the allocated, but needed to be - kernel allocates, then allows access
-SLUB is [[slab allocation]], i.e. about dynamic allocation of kernel memory
+* mapped file - kernel reads in the requested data, then allows access
+* '''major page fault''' refers to:
+** swapped out - kernel swaps it back in, then allows access
-This particular warning seems most related to a bug in memory accounting.
+* '''invalid page fault''' is basically
+** a [[segmentation fault]] - send SIGSEGV (default SIGSEGV hanler is to kill the process)
+Note that most are not errors.
-It seems more likely to happen around containers with cgroup kmem accounting,
+In the case of a [[memory mapped IO]], this is the designed behaviour.
-(not yet stable in 3.x, and apparently there are still footnotes in 4.x)
-but happens outside as well?
-There was a kernel memory leak
+Minor will often happen regularly, because it includes mechanisms that are cheap, save memory, and thereby postpone major page faults.
+Major ideally happens as little as possibly, because memory access is delayed by disk IO.
 -->
+See also
+* http://en.wikipedia.org/wiki/Paging
+* http://en.wikipedia.org/wiki/Page_fault
+* http://en.wikipedia.org/wiki/Demand_paging
 ===Copy on write===
@@ Line 756: / Line 1,168: @@
 -->
 ===Glossary===