Virtual memory: Difference between revisions

From Helpful
Jump to navigation Jump to search
 
m (Removed redirect to On computer memory#On virtual memory)
Tag: Removed redirect
Line 1: Line 1:
#redirect [[On_computer_memory#On_virtual_memory]]
=Virtual memory=
{{stub}}
 
 
Virtual memory ended up doing a number of different things,
which for the most part can be explained separately.
 
 
===Intro===
<!--
A '''virtual memory system''' is one in which running code never deals ''directly'' with physical addresses.
 
Instead,
each task gets its own address space.
some sort of translation, between the addresses that the OS/programs see, and the physical addresses and memory that actually goes to, via a lookup table.
 
 
No matter the addresses used within each task, they can't clash in physical memory (or rather, ''won't'' overlap until the OS specifically allows it - see shared memory).
 
There are a handful of reasons this can be useful. {{comment|(Note: this is a broad-strokes introduction that simplifies and ignores a lot of historical evolution of how we got where we are and ''why'' - a bunch of which I know I don't know)}}.
 
 
The larger among these ideas is '''protected memory''': that lookup can easily say "that is not allocated to you, ''denied''", meaning a task can never accidentally access memory it doesn't own. (once upon a time any program could access any memory, but this has practical issues)
 
This is useful for stability, in that a user task can't bring down a system task accidentally. Misbehaving tasks will fail in isolation.
 
It's also great for security, in that tasks can't do it intentionally - you can't read what anyone else is doing.
 
{{comment|(Note that you can have protection ''without'' virtual addresses, if you keep track of what belongs to a task. A few embedded systems opt for this because it can be a little simpler (and a little faster) without that extra step of indirection. Yet in general you get and want both.)}}
 
 
Another reason is that processes (and most programmers) don't have to think about other tasks, the OS, or their management.  Say, in the DOS days, you all used the same memory space so memory management was a more cooperative thing -- which is a pain and one of various reasons you would run one thing at a time (with few exceptions).
 
 
There are other details, like
* an OS can effectively unify underlying changes over time, varying hardware lookup/protection implementations, with extensions and variations even in the same CPU architecture/family.
 
* it can make fragments of RAM look contiguous to a process, which makes life much easier for programmers, and has negligible effect on speed (because of the RA in RAM).
: generally the VMM does try to minimise fragmentation where possible, because too much can trash the fixed-size TLB
 
* on many systems, the first page in a virtual address space is marked unreadable, which is how null pointer references can be caught more easily/efficiently than on systems without MMU/MPUs.
 
* In practice it matters that physical/virtual mapping is something a cache system can understand. There are other solutions that are messier.
 
 
 
 
'''Lower levels'''
 
Which bits of memory belongs to what task is ''managed'' by Virtual Memory Manager (VMM) - which is software, and part of the OS kernel.
 
 
Since memory access is so central to function, as well as speed of things, the actual translation between virtual and physical address is typically offloaded to dedicated hardware, often called the MMU (Memory Management Unit), and its Translation lookaside buffer (TLB) is large part of that. {{comment|(which, since it has a limited size, it is essentially a cache of a fuller list managed by the VMM, which is slightly slower to access. Which is why soft page faults not errors, actually quite normal, though still something you want to minimise)}}.
 
 
'''These days'''
 
The computer you're working on right now most likely has an MMU.
 
Some systems don't virtualise but still protect, in which case it's probably called a Memory Protection Unit (MPU). This is done on some embedded platforms (e.g. some ARMs), r.g. for real-time needs, as an MMU can in some (relatively rare) cases hold you up a few thousand cycles.
 
And some no neither - in particular simpler microcontrollers, which run just one program, and any sort of multitasking is cooperative.
 
 
 
'''Mid to high levels'''
 
Once the VMM was a thing, it allowed ideas more complex than just dividing memory.
 
This includes (and is not limited to):
* overcommitting RAM and virtual memory
* swapping / paging
* memory mapped IO
* sharing libraries between processes (read-only)
* sharing memory between processes (read/write)
* sharing memory between kernel and processes
* lazy backing of allocated memory (e.g. allocate on use, copy on write)
* system cache (particularly the disk cache) that can always yield to process allocation
 
Most such ideas ended up entangled with each other, which is what makes it hard to have a complete view on what exactly modern memory management in a modern OS is doing.  Which is fine in that it's not very important unless you're a kernel programmer, or maybe doing some ''very'' specific bug analysis.
 
Still, it helps to have a basic grasp on what's going on - even just to read out memory statistics.
 
 
 
Knowing some terms also helps having sensible distinctions, like that
: '''mapped''' memory often means 'anything a task can see, according to the VMM'
:: which includes things beyond what is uniquely allocated to just it
: '''committed''' memory often means it has guaranteed backing, be in RAM or on disk
 
: Most systems have a distinct name between 'free as in unused' and 'free as in used for caches but can be yielded almost immediately when you ask for it'
:: how this is reported varies between OSes, and versions. For example, it seems only recent windows makes an explicit distinction between 'Free' and 'Standby'.
 
 
 
Things like 'total used memory' is not as simple as you'ld think.
 
Consider:
* shared libraries can be many processes mapping just one bit of memory (read-only{{verify}})
 
* shared memory is multiple processes mapping the same memory (read-write)
 
* [[memory mapped IO]] and [[memory mapped files]]
: which are not backed by RAM, just have the VMM pretending they're there
 
 
There is a useful distinction between private memory (only available to one task) is simple, but there is also '''shareable memory'''.
 
Shared-anything should probably be counted just once in summaries,
and e.g. memory mapped files not even once because they're just an abstraction.
 
And then there's swapping, a topic in itself.
 
And then there's overcommit, where we allow programs to ask for a little more memory
than we have storage to actively back it. Another separate topic.
 
-->
 
 
===Overcommitting RAM with disk: Swapping / paging; trashing===
 
<!--
{{comment|(Note that what windows usually calls paging, unixen usually call swapping. In broad descriptions you can treat them the same. Once you get into the details the workings and terms do vary, and precise use becomes more important.)}}
 
 
Swapping/paging is, roughly, the idea that the VMM can have a pool of virtual memory that comes from RAM ''and'' disk.
 
This means you allocate more total memory than would fit in RAM at the same time. {{comment|(It can be considered overcommit of RAM, though note this is ''not'' the usual/only meaning the term overcommit, see below)}}.
 
The VMM decides which parts go from RAM to disk, when, and how much of such disk-based memory there is.
 
 
Using disk for memory seems like a bad idea, as disks are significantly slower than RAM in both bandwidth and latency.
 
Which is why the VMM will always prefer to use RAM.
 
 
There are a few reasons it can make sense:
 
* there is often some percentage of each program's memory that is inactive
: think "was copied in when starting, and then never accessed in days, and possibly never will be"
: if the VMM may adaptively move that to disk (where it will still be available if requested, just slower), that frees up more RAM for ''active'' programs (or caches) to use.
: not doing this means a percentage of RAM would always be entirely inactive
: doing this means slow access whenever you ''did'' need that memory after all
: this means a small amount of swap space is almost always beneficial
: it also doesn't make that much of a difference, because most things that are allocated have a purpose. '''However'''...
 
 
* there are programs that blanket-allocate RAM, and will never access part/most of it even once.
: as such, the VMM can choose to not back allocated with anything, until the first use.
:: this mostly just saves some up-front work
: separately, there is a choice of how to count this not-yet used memory
:: you could choose to not count that memory at all - but that's risky, and vague
:: usually it counts towards  ''swap/page'' area  (often without ''any'' IO there)
: this means a ''bunch'' of swap space can be beneficial, even if just for bookkeeping without ever writing to it
:: just to not have to give out memory we never used
:: while still actually having backing if they ever do.
 
 
And yes, in theory neither would be necessary if programs behaved with perfect knowledge of other programs, of the system, and of how their data gets used.
 
In practice this usually isn't feasible, so it makes sense to do this at OS-level basically with a best-guess implementation.
 
In most cases it has a mild net positive effect, largely because both above reasons mean there's a little more RAM for active use.
 
 
Yes, it is ''partly'' circular reasoning, in that programmers now get lazy doing such bulk allocations knowing this implementation, thereby ''requiring'' such an implementation.
Doing it this way has become the most feasible because we've gotten used to thinking this way about memory.
 
 
 
Note that neither reason should impact the memory that programs actively use.
 
Moving inactive memory to disk will also rarely slow ''them'' down.
Things that periodically-but-very-infrequently do a thing may need up to a few extra seconds.
 
 
There is some tweaking to such systems
* you can usualy ask for RAM that is never swapped/paged. This is important if you need to guarantee it's always accessed within very low timespan (can e.g. matter for real-time music production based on samples)
 
* you can often tweak how pre-emptive swapping is
: To avoid having to swap things to disk during the next memory allocation request, it's useful to do so pre-emptively, when the system isn't too busy.
: this is usually somewhat tweakable
 
 
 
 
'''Using more RAM than you have'''
 
The above begs the question what happens when you attempting to actively to use more RAM than you have.
 
This is a problem with ''and'' without swapping, with and without overcommit.
 
 
Being out of memory is a pretty large issue. Even the simplest "use what you have, deny once that's gone" system would have to just deny allocations to programs.
 
Many programs don't check every allocation, and may crash if not actually given what they ask for.  But even if they handled denied allocation perfectly elegantly, in many cases the perfect behaviour still often amount to stopping the program.
 
 
Either way, the computer is no longer able to do what you have.
 
And there is an argument that it is preferable to have it continue, however slowly,
in the hope this was some intermittent bad behaviour that will be solved soon.
 
 
When you overcommit RAM with disk, this happens somewhat automatically.
And it's slow as molasses, because some of the actively used memory is now going not via microsecond-at-worst RAM but millisecond-at-best disk.
 
 
While there are cases that are less bad, it's doing this ''continuously'' instead of sporadically.
 
This is called '''trashing'''. If your computer suddenly started to continously rattle its disk while being verrry slow, this is what happened.
 
 
{{comment|(This is also the number one reason why adding RAM may help a lot for a given uses -- or not at all, if this was not your problem.)}}
 
 
===Overcommitting (or undercommitting) virtual memory, and other tricks===
<!--
 
Consider we have a VMM system with swapping, i.e.
* all of the actively used virtual memory pages are in RAM
* infrequently used virtual memory pages are on swap
* never-used pages are counted towards swap {{comment|(does ''not'' affect the ammount of allocation you can do in total)}}
 
Overcommit is a system where the last point can instead be:
* never-used pages are nowhere.
 
 
'''More technically'''
 
More technically, overcommit allows allocation of address space, without allocating memory to back it.
 
Windows makes you do both of those explicitly,
implying fairly straightforward bookkeeping,
and that you cannot do this type of overcommit.
{{comment|(note: now less true due to compressed memory{{verify}})}}
 
 
Linux implicitly  allows that separation,
basically because the kernel backs allocations only on first use {{comment|(which is also why some programs will ensure they are backed by something by storing something to all memory they allocate)}}.
 
Which is separate from overcommit; if overcommit is disabled is merely saves some initialisation work.
But with overcommit (and similar tricks, like OSX's and Win10's compressed memory, or linux's [[zswap]]) your bookkeeping becomes more flexible.
 
Which includes the option to give out more than you have.
 
 
'''Why it can be useful'''
 
Basically, when there may be a good reason pages will ''never'' be used.
 
The difference is that without overcommit this still needs to all count towards something (swap, in practive), but and that overcommit means the bookkeeping assumes you will always have a little such used-in-theory-but-never-practice use.
 
How wise that is depends on use.
 
 
There are two typical examples.
 
 
One is that a large process may fork().
In a simple implementation you would need twice the memory,
but in practice the two forks' pages are copy-on-write, meaning they will be shared until written to.
Meaning you still need to do bookkeeping in case that happens, but even if it's another worker it probably won't be twice.
 
In the specific case where it wasn't for another copy of that program, but to instead immediately exec() a small helper program, that means the pages will ''never'' be written.
 
 
The other I've seen is mentioned in the kernel docs: scientific computing that has very large, very sparse arrays.
This is essentially said computing avoiding writing their own clever allocator, by relying on the linux VMM instead.
 
 
 
Most other examples arguably fall under "users/admins not thinking enough".
Consider the JVM, which has its own allocator which you give an initial and max memory figure at startup.
Since it allocates memory on demand (also initialises it{{verify}}), the user may ''effectively'' overcommit by having the collective -Xmx be more than RAM.
That's not really on the system to solve, that's just bad setup.
 
 
 
 
'''Critical view'''
 
Arguably, having enough swap makes this type of overcommit largely unnecessary, and mainly just risky.
 
The risk isn't too large, because it's paired with heuristics that disallow silly allocations,
and the oom_killer that resolves most runaway processes fast enough.
 
 
 
It's like [https://en.wikipedia.org/wiki/Overselling#Airlines overselling aircraft seats], or [https://en.wikipedia.org/wiki/Fractional-reserve_banking fractional reserve banking].
 
It's a promise that is ''less'' of a promise, it's fine (roughly for the same reasons that systems that allow swapping are not continuously trashing), but once your end users count on this, the concept goes funny, and when everyone comes to claim what's theirs you are still screwed.
 
 
Note that windows avoids the fork() case by not having fork() at all (there's no such cheap process duplication, and in the end almost nobody cares).
 
 
Counterarguments to overcommit include that system stability should not be based on bets,
that it is (ab)use of an optimization that you should not be counting on,
that programs should not be so lazy,
and that we are actively enabling them to be lazy and behave less predictably,
and now sysadmins have to frequently figure out why that [[#oom_kill|oom_kill]] happened.
 
 
Yet it is harder to argue that overcommit makes things less stable.
 
Consider that without overcommit, memory denials are more common (and that typically means apps crashing).
 
With or without overcommit, we are currently already asking what the system's emergency response should be (and there is no obvious answer to "what do we sacrifice first") because improper app behaviour is ''already a given''.
 
 
Arguably oom_kill ''can'' be smarter, usually killing only an actually misbehaving program.
Rather than a denial probably killing the next program (more random).
 
But you don't gain much reliability either way.
 
{{comment|(In practice oom_kill can take some tweaking, because it's still possible that e.g. a mass of smaller
programs lead to the "fix" of your big database getting killed)}}
 
 
 
'''So is it better to disable it?'''
 
No, it has its benificial cases, even if they are not central.
 
Disabling also won't prevent swapping or trashing,
as the commit limit is typically still > RAM {{comment|(by design, and you want that. Different discussion though)}}.
 
But apps shouldn't count on overcommit as a feature, unless you ''really'' know what you're doing.
 
Note that if you want to keep things in RAM, you probably want to lower [[#swappiness|swappiness]]) instead.
 
 
 
 
'''Should I tweak it?'''
 
Possibly.
 
Linux has three modes:
* overcommit_memory=2: No overcommit
: userspace commit limit is swap + fraction of ram
: if that's &lt;RAM, the rest is only usable by the kernel, usually mainly for caches (which can be a useful mechanism to dedicate some RAM to the [[page cache]])
 
* overcommit_memory=1: Overcommit without checks/limits.
: Appropriate for relatively few cases, e.g. the very-space array example.
: in genera just more likely to swap and OOM.
 
* overcommit_memory=0: Overcommit with heuristic checks (default)
: refuses large overcommits, allows the sort that would probably reduce swap usage
 
 
 
These mainly control the maximum allocation limit for userspace programs.
This is still a fixed number, and still ''related'' to the amount of RAM, but the relation can be more interesting.
 
On windows it's plainly what you have:
swap space + RAM
 
and linux's it's:
swap space + (RAM * (overcommit_ratio/100) )
or, if you instead use overcommit_kb,
swap space + overcommit_kb {{verify}}
 
 
Also note that 'commit_ratio' might have been a better name,
because it's entirely possible to have that come out as ''less' than RAM, undercommit if you will.
 
This undercommit is also a sort of feature, because while that keeps applications from using it,
this ''effectively'' means it's dedicated to (mainly) kernel cache and buffers.
 
 
 
Note that the commit limit is ''how much'' it can allocate, not where it allocates from (some people assume this based on how it's calculated).
 
Yet if the commit limit is less than total RAM, applications will never be able to use all RAM.
This may happen when you have a lot of RAM and/or very little swap.
 
 
Because when you use overcommit_ratio (default is 50), the value (and sensibility of the) of the commit limit essentially depends on the ''ratio'' between swap space and RAM.
 
Say,
: 2GB swap, 4GB RAM, overcommit_ratio=50: commit limit at (2+0.5*4) = 4GB.
: 2GB swap, 16GB RAM overcommit_ratio=50: (2+0.5*16) = 10GB.
: 2GB swap, 256GB RAM overcommit_ratio=50: (2+0.5*256) = 130GB.
 
: 30GB swap, 4GB RAM overcommit_ratio=50: (30+0.5*4) = 32GB.
: 30GB swap, 16GB RAM overcommit_ratio=50: (30+0.5*16) = 38GB.
: 30GB swap, 256GB RAM overcommit_ratio=50: (30+0.5*256) = 156GB.
 
 
So
* and/or (more so if you have a lot of RAM) you may consider setting overcommit_ratio higher than default
: possibly close to 100% {{comment|(or use overcommit_kb instead because that's how you may be calculating it anyway)}}
: and/or more swap space.
 
* if you desire to leave some dedicate to caches (which is a good idea) you have to do some arithmetic.
: For example, witjh 4GB swap and 48GB RAM,
:: you need ((48-4)/48=) ~91% to cover RAM,
:: and ((48-4-2)/48=) ~87% to leave ~2GB for caches.
 
* this default is why people suggest your swap area should be roughly as large as your RAM (same order of magnitude, anyway)
 
 
 
 
 
 
 
'''Should I add more RAM instead?'''
 
Possibly. It depends on your typical and peak load.
 
More RAM improves performance noticeably only when it avoids swapping under typical load.
 
It helps little beyond that. It helps when it means files you read get cached (see [[page cache]]),
but beyond that has ''no'' effect.
 
 
 
 
 
 
 
 
 
Other notes:
* windows is actually more agressive about swapping things out - it seems to favour favour of IO caches
* linux is more tweakable (see [[#swappiness|swappiness]]) and by default is less aggressive.
 
 
* overcommit makes sense if you have significant memory you reserve but ''never'' use
: which is, in some views, entirely unnecesssary
: it should probably be seen as a minor optimization, and not a feature you should (ab)use
 
 
 
Unsorted notes
* windows puts more importance on the swap file
 
* you don't really want to go without swap file/space on either windows or linux
: (more so if you turn autocommit off on linux)
 
* look again at that linux equation. That's ''not'' "swap plus more-than-100%-of-RAM"
: and note that if you have very little swap and or tons of RAM (think >100GB), it can mean your commit limit is lower than RAM
 
* swap will not avoid oom_kill altogether - oom_kill is triggered on low speed of freeing pages {{verify}}
 
 
 
 
 
-->
 
<!--
See also:
* https://serverfault.com/questions/362589/effects-of-configuring-vm-overcommit-memory
 
* https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
 
* https://www.win.tue.nl/~aeb/linux/lk/lk-9.html
 
* http://engineering.pivotal.io/post/virtual_memory_settings_in_linux_-_the_problem_with_overcommit/
 
* https://serverfault.com/questions/362589/effects-of-configuring-vm-overcommit-memory
 
-->
 
===Page faults===
 
<!--
Consider that in a system with a VMM, applications only ever deal in virtual addresses; it is the VMM that implies translation to real backing storage.
 
 
And when doing memory access, one of the options is this access makes sense (the poage is known, and considered accessible) but cannot be accessed in the lightest sort of pass-through-to-RAM way.
 
A page fault, widely speaking, means "instead of direct access, the kernel needs to decide what to do now".
 
That signalling is called a page fault {{comment|(Microsoft also uses the term 'hard fault')}}.
 
Note that it's a ''signal'', caught by the OS kernel. It's called 'fault' only for historical low-level design reasons.
 
 
This can mean one of multiple things. Most said ground is covered by the following cases:
 
 
'''Minor page fault''', a.k.a. '''soft page fault'''
: Page is actually in RAM, but not currently marked in the MMU page table (often due to its limited size{{verify}})
: resolved by the kernel updating the MMU, ''then'' just allowing the access.
:: No memory needs to be moved around.
:: Very little extra latency
: (can happen around shared memory, or around memory that has been unmapped from processes but there had been no cause to delete it just yet - which is one way to implement a page cache)
 
 
'''Major page fault''', a.k.a. '''hard page fault'''
: memory is mapped, but not currently in RAM
: i.e. mapping on request, or loading on demand -- which how you can do overcommit and pretend there is more memory (which is quite sensible where demand rarely happens)
: resolved by the kernel finding some space, and loading that content. Free RAM, which can be made by swapping out another page.
:: Adds noticeable storage, namely that of your backing storage
:: the latter is sort of a "fill one hole by digging another" approach, yet this is only a real problem (trashing) when demand is higher than physical RAM
 
 
'''Invalid page fault'''
* memory isn't mapped, and there cannot be memory backing it
: resolved by the kernel raising a [[segmentation fault]] or [[bus error]] signal, which terminates the process
 
 
 
DEDUPE WITH ABOVE
 
 
or not currently in main memory (often meaning swapped to disk),
or it does not currently have the backing memory mapped{{verify}}.
 
Depending on case, these are typically resolved either by
* mapping the region and loading the content.
: which makes the specific memory access significantly slower than usual, but otherwise fine
 
* terminating the process
: when it failed to be able to actually fetch it
 
 
 
Reasons and responses include:
* '''minor page fault''' seems to includes:{{verify}}
** MMU was not aware that the page was accessible - kernel inform it is, then allows access
** writing to copy-on-write memory zone - kernel copies the page, then allows access
** writing to page that was promised by the allocated, but needed to be - kernel allocates, then allows access
 
* mapped file - kernel reads in the requested data, then allows access
 
* '''major page fault''' refers to:
** swapped out - kernel swaps it back in, then allows access
 
* '''invalid page fault''' is basically
** a [[segmentation fault]] - send SIGSEGV (default SIGSEGV hanler is to kill the process)
 
 
Note that most are not errors.
 
In the case of a [[memory mapped IO]], this is the designed behaviour.
 
 
Minor will often happen regularly, because it includes mechanisms that are cheap, save memory, and thereby postpone major page faults.
 
Major ideally happens as little as possibly, because memory access is delayed by disk IO.
 
-->
 
See also
* http://en.wikipedia.org/wiki/Paging
 
* http://en.wikipedia.org/wiki/Page_fault
* http://en.wikipedia.org/wiki/Demand_paging
 
===Swappiness===
{{stub}}
 
<!--
The aggressiveness with with an OS swap out allocated-but-inactive pages to disk is often controllable.
 
Linux dubs this ''swappiness''. Higher swappiness mean the tendency to swap out is higher. {{comment|(other information is used too, including the currently mapped ratio, and a measure of how much trouble the kernel has recently had freeing up memory)}}{{verify}}
 
 
 
Swapping out is always done with cost/benefit considerations.
 
The cost is mainly the time spent,
the benefit giving RAM to caches, and to programs (then also doing some swapping now rather than later).
 
(note that linux swaps less aggressive than windows to start with, at least with default settings)
 
 
There are always pages that are inactive simply because programs very rarely use it (80/20-like access patterns).
 
But with plenty of free free RAM it might not even swap ''those'' out, because benefit is so low.
I had 48GB and 256GB workstations at work and people rarely got them to swap ''anything''.
 
 
 
It's a gliding scale. To illustrate this point, consider the difference between:
 
* using more RAM than we have - we will probably swap in response to every allocation
: or worse, in the case of trashing: we are swapping purely to avoid crashing the system
: Under high memory strain, cost of ''everything'' is high, because we're not swapping to free RAM for easier future use, we're swapping to not crash the system.
 
 
* Swapping at any other time is mainly about pro-actively freeing up RAM for near-future use.
: being IO we otherwise have to concentrate to during the next large allocation request
:: arguing for ''higher'' swappiness, because it will effectively spread that work over time,
 
 
These are entirely different cases.
* The former clobbers caches, the latter builds it up
 
* the former ''is'' memory strain, the latter ''may'' lessen it in the future
: (if the peak use is still sensible, and won't trash itself along with everything else)
 
 
 
Arguments for '''lower''' swappiness:
* Delays putting things on slower disk until RAM is necessary for something else
** ...avoiding IO (also lets drives spin down, which can matters to laptop users)
** (on the flipside, when you want to allocate memory ''and'' the system needs to swap out things first to provide that memory, it means more work, IO, and sluggishness concentrated at that time)
 
* apps are more likely to stay in memory (particularly larger ones). Over-aggressive swapout (e.g. inactivity because you went for coffee) is less likely, meaning it is slightly less likely that you have to wait for a few seconds of churning swap-in when you continue working
: not swapping out GUI programs makes them ''feel'' faster even if they don'
 
* When your computer has more memory than you actively use, there will be less IO caused by swapping inactive pages out and in again (but there are other factors that ''also'' make swapping less likely in such cases)
 
 
Arguments for '''higher''' swappiness seem to include{{verify}}:
* When there is low memory pressure, caches is what makes (repetitive) disk access faster.
 
* keeps memory free
** spreads swap activity over time, useful when it is predictably useful later
** free memory is usable by the OS page cache
 
* swapping out rarely used pages means new applications and new allocations are served faster by RAM
: because it's less likely we have to swap other things out at allocation time
 
* allocation-greedy apps will not cause swapping so quickly, and are served more quickly themselves
 
 
 
 
'''On caches'''
 
Swappiness applies mostly to process's memory, and not to kernel constructs like the OS page cache, dentry cache, and inode cache.
 
 
That means that swapping things out increases the amount of OS page cache we have.
 
 
From a perspective of data caching, you can see swappiness as one knob that (somewhat indirectly) controls how likely data will sit in a process, OS cache, or swapped out.
 
 
 
Consider for example the case of large databases (often following some 80/20-ish locality patterns).
 
If you can make the database cache data in its own process memory, you may want lower swappiness, since that makes it more likely that needed data is still in memory.
 
 
If you ''disable'' that in-process caching of tables, then might get almost the same effect, because the space freed is instead left of the OS page cache, which may then store all the file data you read most - which can be entirely the same thing (if you have no other major programs on the host).
 
{{comment|(In some cases (often  mainly 'when nothing else clobbers it'), the OS page cache is a simple and great solution. Consider how a file server will automatically focus on the most common files, transparently hand it to multiple processes, etc.
 
Sure, for some cases you design something smarter, e.g. a LRU memcache.
 
And of course this cache is bad to count on when other things on the server start vying for the same cache (and clobbering it as far as you're concerned).
 
This also starts to matter when you fit a lot of different programs onto the same server so they start vying for limited memory.
 
 
 
'''Server versus workstation'''
 
 
 
There is some difference between server and workstation.
 
Or rather, a system that is more or less likely to touch on the same data repeatedly,
hence value caches. A file server typically will, other servers frequently will.
 
 
Desktop tends to see relatively random disk access so cache doesn't matter much.
 
Instead, you may care to avoid GUI program swapped out much,
by having ''nothing'' swap out even when approaching memory pressure.
 
This seems like micromanaging for a very specific case (you're off as badly at actual memory pressure, and off as well when you have a lot of free RAM), but it might sometimes apply.
 
 
 
'''Actual tweaking'''
 
 
There is also:
* vm.swappiness -
* vm.vfs_cache_pressure -
 
 
 
 
 
 
In linux you can use proc or sysctl to check and set swappiness
cat /proc/sys/vm/swappiness
sysctl vm.swappiness
...shows you the current swappiness (a number between 0 and 100), and you can set it with something like:
echo 60 >  /proc/sys/vm/swappiness
sysctl -w vm.swappiness=60
 
 
 
 
 
 
 
 
This is '''not'' a percentage, as some people think. It's a fudgy value, and hasn't meant the same thing for all iterations of the code behind this.
 
Some kernels do little swapping for values in the range 0-60 (or 0-80, but 60 seems the more common tipping point).
 
It seems gentler tweaking is in the 20-60 range.
 
 
A value of 100 or something near it tends to make for very aggressive swapping.
 
* 0 doesn't disable, but should be pretty rare until memory pressure (which probably makes oom_kill likelier to trigger)
 
* Close to 100 is very aggressive.
 
 
1 is enabled by very light
 
up to 10
 
 
Note that the meaning of the value was never very settled, and has changed with kernels versions {{comment|(for example, (particularly later) 2.6 kernels swap out more easily under the same values than 2.4)}}.
 
 
 
* If you swap to SSD, you might lower swappiness to make it live longer
: but memory use peaks will affect it more than swappiness
 
 
 
 
 
People report that
* interaction with a garbage collector (e.g. JVM's) might lead to regular swapping
: so argue for lower swappiness
 
* servers:
: 10 may ''may'' make sense e.g. on database servers to focus on caches
: on a dedicated machine, if what you keep in apps may instead be in OS cache it may matter little
:
 
* desktops:
: around &le;10 starts introducing choppiness and pauses (probably because it concentrates swapping IO to during allocation requests)
 
 
* VMs make things more interesting
 
* containers too make things more interesting
 
 
 
 
See also:
* http://lwn.net/Articles/83588/
 
* https://lwn.net/Articles/690079/
 
* https://askubuntu.com/questions/184217/why-most-people-recommend-to-reduce-swappiness-to-10-20
 
-->
 
===Practical notes===
 
====Linux====
<!--
 
It seems that *nix swapping logic is smart enough to do basic RAID-like spreading among its swap devices, meaning that a swap partition on every disks that isn't actively used (e.g. by by something important like a database) is probably useful.
 
 
Swap used for hibernation can only come from a swap partition, not a swap file {{comment|(largely because that depends too much on whether the underlying filesystem is mounted)}}.
 
 
Linux allows overcommit, but real world cases vary.
It depends on three things:
* swap space
* RAM size
* overcommit_ratio (defaults to 50%)
 
When swap space is, say, half of RAM size,
 
On servers/workstations with at least dozens of GBs of RAM,
this will easily mean overcommit_ratio should be 80-90 for userspace to be able to use most/all RAM.
 
If the commit limit is lower than RAM, the rest goes (mostly) to caches and buffers.
Which, note is often useful, and sometimes it can even be preferable to effectively have a little dedicated cache. 
 
-->
 
===="How large should my page/swap space be?"====
 
<!--
Depends on use.
 
Generally, the better answer is to consider:
* your active workload, what that may require of RAM in the worst case
: some things are fixed and low (lots of programs)
: some can scale up depending on use (think of editing a high res photo)
: some can make use anything they get (caches, database)
: some languages have their own allocators, which may pre-allocate a few GB but may never use it
: some programs are eager to allocate and less eager to clean up / garbage collect
 
* how much is inactive, so can be swapped out
: less than a GB in most cases, and a few GB in a few
 
* too much swap doesn't really hurt
 
 
This argues that throwing a few GB at it is usually more than enough,
maybe a dozen or two GB when you have hungry/careless programs.
 
 
Servers are sometimes little different.
 
The numbers tend to be bigger - but ideally also more predictable.
 
And tweaking can make more sense because of it.
For example, when two major services try to each use 70% of RAM for caches,
they'll end up pushing each other to swap (and both incur disk latency),
and you're better off halving the size of each,
when that implies means never involving disk.
 
 
Additionally:
* on linux, hibernation reserves space in swap
: so if you use this, you need to add RAM-size to the above
: doesn't apply to windows, it puts hibernation in a separate preallocated file
 
* on windows, a crash dump (when set to dump all RAM) needs it to b [https://support.microsoft.com/en-us/help/2860880/how-to-determine-the-appropriate-page-file-size-for-64-bit-versions-of]
: so if you use this, you need to add RAM-size to the above
 
 
 
"Your swap file needs to be 1.5x RAM size"
 
This is arbitrary.
 
As shown by that number changing over time.
I've even seen tables that also vary that factor significantly with RAM.
 
 
On basic PC use, at least 1GB is a good idea.
 
If you have a lot of RAM, you probably did that because you have memory hungry programs, and a few GB more is probably useful.
 
When RAM sizes like 2 or 4GB were typical (you know, ~2000 or so), this amounted to the same thing as 1.5x.
 
But these days, 1.5x means larger and larger amounts that will probably never bse used.  Which is not in the way, and often not ''that'' much much of a bite out of storage. It's not harmful, it's just y pointless.
 
 
 
"Too large a page file slows down your system"
 
I don't see how it could.
 
 
The only indirect way I can think of is paging behaviour that becomes more aggressive based on actual use but based on how much of each you have. But as far as I know that's not how it works.
 
Or perhaps if your page file fills your disk space to nearly full, and contributes to fragmentation of regular files.
And even that is less relevant now that SSDs are getting typical.
 
 
 
"Page file of zero is faster because it keeps everything in memory"
 
True-ish, but not enough to matter in most cases.
 
That is, most accesses, of almost all active programs,
will not be any faster or slower, because most active use come from RAM with or without swap enabled.
 
 
The difference is that
* the rarely-accessed stuff stays in RAM and will not be slower.
 
* you get less usable RAM, because it now holds everything that is never accessed.
: ...this reduction is sometimes sometimes significant, depending on the specific programs you run, and how much they count on inactive-swap behaviour.
 
* if swap means it's going to push things into it instead of deny allocation, then your system is more likely to recover eventually (see oom_kill), and not stop outright.
: this argues for maybe an extra RAM-size, because some programs are just that greedy
 
 
-->
 
===On memory scarcity===
 
<!--
On a RAM-only system you will find you at some point cannot find free pages.
 
 
When you've added swap and similar features,
you may find your bookkeeping says it can be done,
but in practice it will happen very slowly.
 
 
Also, having disconnected programs from the backing store,
only the kernel can even guess at how bad that is.
 
 
The most obvious case is more pages being actively used than there is physical RAM (can happen without overcommit, more easily with), but there are others. Apparently things like hot database backups may create so many [[dirty pages]] so quickly that the kernel decides it can't free anywhere near fast enough.
 
 
 
In a few cases it's due to a sudden (reasonable) influx of dirty pages, but otherwise transient.
 
But in most cases scarcity is more permanent, means we've started swapping and probably [[trashing]], making everything slow.
 
Such scarcity ''usually'' comes from a single careless / runaway,
sometimes just badly configured (e.g. you told more than one that they could take 80% of RAM), sometimes from a slew of (probably-related) programs.
 
-->
 
 
 
=====oom_kill=====
 
<tt>oom_kill</tt> is linux kernel code that starts killing processes when there is enough memory scarcity that memory allocations cannot happen within reasonable time - as this is good indication that it's gotten to the point that we are trashing.
 
 
Killing processes sounds like a poor solution.
 
But consider that an OS can deal with completely running out of memory in roughly three ways:
* '''deny ''all'' memory allocations until the scarcity stops.'''
: This isn't very useful because
:: it will affect ''every'' program until scarcity stops
:: if the cause is one flaky program - and it usually is just one - then the scarcity may not stop
:: programs that do not actually check every memory allocation will probably crash.
:: programs that ''do'' such checks well may have no option but to stop completely (maybe pause)
: So in the best case, random applications will stop doing useful things - probably crash, and in the worst case your system will crash.
 
* '''delay memory allocations''' until they can be satisfied
: This isn't very useful because
:: this pauses all programs that need memory (they cannot be scheduled until we can give them the memory they ask for) until scarcity stops
:: again, there is often no reason for this scarcity to stop
: so typically means a large-scale system freeze (indistinguishable from a system crash in the practical sense of "it doesn't actually ''do'' anything")
 
* '''killing the misbehaving application''' to end the memory scarcity.
: This makes a bunch of assumptions that have to be true -- but it lets the system recover
:: assumes there ''is'' a single misbehaving process {{comment|(not always true, e.g. two programs allocating most of RAM would be fine individually, and needs an admin to configure them better)}}
::: ...usually the process with the most allocated memory, though <tt>oom_kill</tt> logic tries to be smarter than that.
:: assumes that the system has had enough memory for normal operation up to now, and that there is probably ''one'' haywire process (misbehaving or misconfigured, e.g. (pre-)allocates more memory than you have)
:: this ''could'' misfire on badly configured systems (e.g. multiple daemons all configured to use all RAM, or having no swap, leaving nothing to catch incidental variation)
 
 
 
Keep in mind that
 
* oom_kill is sort of a worst-case fallback
: generally
:: if you feel the need to rely on the OOM, '''don't.'''
:: if you feel the wish to overcommit, don't
: oom_kill is meant to deal with pathological cases of misbehaviour
:: but even then might pick some random daemon rather than the real offender, because in some cases the real offender is hard to define
Tweak likely offenders, tweak your system.
: note that you can isolate likely offenders via [[cgroups]] now.
:: and apparently oom_kill is now cgroups-aware
 
* oom_kill does not always save you.
: It seems that if your system is [[trashing]] heavily already, it may not be ''able'' to act fast enough.
: (and possibly go overboard once things do catch up)
 
* You may wish to disable oom_kill when you are developing
: ...or at least equate an oom_kill in your logs as a fatal bug in the software that caused it.
 
* If you don't have oom_kill, you may still be able to get reboot instead, by setting the following sysctls:
vm.panic_on_oom=1
and a nonzero kernel.panic (seconds to show the message before rebooting)
kernel.panic=10
 
 
 
See also
* http://mirsi.home.cern.ch/mirsi/oom_kill/index.html
 
 
 
<!--
=====SLUB: Unable to allocate memory on node=====
 
SLUB is [[slab allocation]], i.e. about dynamic allocation of kernel memory
 
 
This particular warning seems most related to a bug in memory accounting.
 
 
 
It seems more likely to happen around containers with cgroup kmem accounting,
(not yet stable in 3.x, and apparently there are still footnotes in 4.x)
but happens outside as well?
 
 
There was a kernel memory leak
 
 
-->
 
===Glossary===
<!--
 
 
Most modern memory management systems are virtual memory systems that have swapping,
combined with a processor that has memory management unit (MMU) that does much of low-level work of mapping between virtual and physical memory.
 
 
Storage hierarchy (though it's usually more like a fallback list):
* '''Main memory''' - memory wired almost directly to the CPU (contasted with backing store), typically referring to [http://en.wikipedia.org/wiki/DRAM DRAM]
 
* '''Backing store'''
** often means 'storage on disk' (and that you are dealing with a swapping system that calls this store '''swap space''')
** ...though in general in a VMM system, it can also refer to RAM ''or'' disk that is committed to backing virtual memory
 
 
* '''Address space''' can refer to either:
** Virtual address space - a (usually distinct for each process)
** Physical address space - the space of valid physical addresses (sometimes: absolute addresses)
 
* '''Virtual address space''' may include
** uniquely mapped memory
** memory mapped IO
** shared memory (common for shared libraries, also used for IPC and such)
 
* '''Virtual memory''' - the meaning varies somewhat. Can refer to various related concepts:
** the concept of using swap to create more committable memory than physical memory (this not being real memory)
** a process's virtual address space
** See also [[#On virtual memory|below]].
 
 
* '''shared memory''' - the case where multiple processes intentionally map the same area of memory.
** Sometimes used to have fewer read-only copies of something in memory
** sometimes used for [[IPC|inter-process communication]].
 
 
* committed/mapped memory - virtual memory that is associated with physical memory. The distinction exists because (particularly) virtual memory systems may reserve virtual memory but not commit it until the range it comes from is accessed (because it may never be).
 
(Note: 'Mapped' is a little comfusing because 'mapping' is also a fairly obvious word to choose for translation and lookup of addresses)
 
* '''commit limit''' - a property of the system as a whole: The amount of memory we are prepared to give to allocations. Typically ''swap + overcommit_factor * physical_ram)''. Some systems do not allow overcommiting (overcommit_factor=1). If overcommit_factor &lt;1.0, we cannot back all promised memory with physical storage - but the point is that very usually, some portion of allocated memory will never be used, and for that portion it is pointless and wasteful to keep physical storage aside.
 
* '''overcommitting''' means that when you ask the kernel for a lot of memory:
** In a non-overcommitting system, the kernel answers with "No. I don't have all of that."
** In an overcommit system, the kernel answers with "Eh, we'll see how many of those pages you'll actually end up using."
** An alternative to overcommitting is adding a ''lot'' of swap space so the system can guarantee backing, with a lot of disk space that will probably never be used {{comment|(well, and RAM, but most VMMs are smart enough to consider unused pages as swapped out)}}.
** ...so depending on how applications work and allocate (...and count on overcommit logic), you may see serious amounts of overcommitment without seeing much RAM use ''or'' swapping.
** See also [[#Overcommit_or_not|the notes below]]
 
 
 
* '''paging, swapping''' - the act of moving data in and out of main memory. See also [[On swapping|below]]
* paged in / swapped in  - data that available in main memory
* paged out / swapped out - data not available in main memory (in most modern systems, this can only mean it is present in swap space)
 
* '''Trashing''' - tends to refer to situations where there is more actively used memory than there is RAM, meaning that relatively actively used content is continuously swapped in and out, making a lot of memory accesses disk speed rather than RAM speed - making the overall computer response very slow. The cause for this often means it will not stop within reasonable time.
 
* Resident set: the part
 
If the total resident set is less than main memory, you'll probably get trashing.
-->
 
 
<!--
Also related:
* Page cache - recently accessed disk area kept temporarily in memory for faster repeated accessed (sometimes with basic prediction, readahead, and such)
* CPU cache - a very specific (and relatively small )
 
See also [[Cache and proxy notes#Real-world_caches]]
-->
 
 
<!--
===Measures in linux===
 
* VmRSS: Resident set size (RES in top?)
* VmHWM: Peak resident set size ('high water mark')
 
* VmData, VmStk, VmExe: Size of data, stack, and text segments.
 
* VmSize: Virtual memory size (VIRT in top)
* VmPeak: Peak virtual memory size.
 
* VmLck: Locked memory size.
 
* VmLib: Shared library code size.
-->

Revision as of 11:41, 10 July 2023

Virtual memory

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Virtual memory ended up doing a number of different things, which for the most part can be explained separately.


Intro

Overcommitting RAM with disk: Swapping / paging; trashing

Page faults

See also

Swappiness

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Practical notes

Linux

"How large should my page/swap space be?"

On memory scarcity

oom_kill

oom_kill is linux kernel code that starts killing processes when there is enough memory scarcity that memory allocations cannot happen within reasonable time - as this is good indication that it's gotten to the point that we are trashing.


Killing processes sounds like a poor solution.

But consider that an OS can deal with completely running out of memory in roughly three ways:

  • deny all memory allocations until the scarcity stops.
This isn't very useful because
it will affect every program until scarcity stops
if the cause is one flaky program - and it usually is just one - then the scarcity may not stop
programs that do not actually check every memory allocation will probably crash.
programs that do such checks well may have no option but to stop completely (maybe pause)
So in the best case, random applications will stop doing useful things - probably crash, and in the worst case your system will crash.
  • delay memory allocations until they can be satisfied
This isn't very useful because
this pauses all programs that need memory (they cannot be scheduled until we can give them the memory they ask for) until scarcity stops
again, there is often no reason for this scarcity to stop
so typically means a large-scale system freeze (indistinguishable from a system crash in the practical sense of "it doesn't actually do anything")
  • killing the misbehaving application to end the memory scarcity.
This makes a bunch of assumptions that have to be true -- but it lets the system recover
assumes there is a single misbehaving process (not always true, e.g. two programs allocating most of RAM would be fine individually, and needs an admin to configure them better)
...usually the process with the most allocated memory, though oom_kill logic tries to be smarter than that.
assumes that the system has had enough memory for normal operation up to now, and that there is probably one haywire process (misbehaving or misconfigured, e.g. (pre-)allocates more memory than you have)
this could misfire on badly configured systems (e.g. multiple daemons all configured to use all RAM, or having no swap, leaving nothing to catch incidental variation)


Keep in mind that

  • oom_kill is sort of a worst-case fallback
generally
if you feel the need to rely on the OOM, don't.
if you feel the wish to overcommit, don't
oom_kill is meant to deal with pathological cases of misbehaviour
but even then might pick some random daemon rather than the real offender, because in some cases the real offender is hard to define
Tweak likely offenders, tweak your system.
note that you can isolate likely offenders via cgroups now.
and apparently oom_kill is now cgroups-aware
  • oom_kill does not always save you.
It seems that if your system is trashing heavily already, it may not be able to act fast enough.
(and possibly go overboard once things do catch up)
  • You may wish to disable oom_kill when you are developing
...or at least equate an oom_kill in your logs as a fatal bug in the software that caused it.
  • If you don't have oom_kill, you may still be able to get reboot instead, by setting the following sysctls:
vm.panic_on_oom=1

and a nonzero kernel.panic (seconds to show the message before rebooting)

kernel.panic=10


See also



Glossary