Virtual memory: Difference between revisions

From Helpful
Jump to navigation Jump to search
 
(34 intermediate revisions by the same user not shown)
Line 4: Line 4:




'Virtual memory' ended up doing a number of different things,
===Intro===
which for the most part can be explained separately.
<!--
{{comment|(Note: this is a broad-strokes introduction that simplifies and ignores a lot of historical evolution of how we got where we are and ''why'' - plus a bunch of which ''I know I don't know yet'')}}.
 
 
'Virtual memory' describes an abstraction that we ended up using for a number of different things.
 
For the ''most'' part, you can explain those reasons separately,
though they got entangled over time (in ways that ''mostly'' operating system programmers need to worry about).
 
 
 
At low level, memory access is "set an address, do a request, get back result".
 
In olden times,
this described hardware that did nothing more than that {{comment|(in some cases you even needed to do that yourself: set a value on address pins, flip the pin that meant a request, and read out data on some other pins)}},
and the point is that there was ''nothing'' that keeps you from doing any request you want.
 
Because you all used the same memory space, memory management was a... cooperative thing where everything needed to play nice.
But that was hard, and beyond conventions to what parts were operating system and you wouldn't touch,
there were no standards to multiple processes running concurrently, unless they actively knew about each other.
 
Which was fine, because multitasking wasn't a buzzword yet.
We ran one thing at a time, and the exceptions to that were clever about how they did that.
 
 
 
To skip a ''lot'' of history {{comment|(the variants on the way are a mess to actually get into)}},
what we have now is a '''virtual memory system''',
where
* each task gets its own address space.
* there is something managing these assignments parts of memory to tasks
* our running code ''never'' deals ''directly'' with physical addresses.
* and when a request is made, ''something'' is doing translation between the addresses that the program sees, and the physical addresses and memory that actually goes to.
 
{{comment|The low level implementation is also interesting, in that there there is hardware assisting this setup - things would be terribly slow if it weren't. At the same time, these details are also largely irrelevant, in that it's always there, and fully transparent even to programmers)}}
 




===Intro===
There are a handful of reasons this addresses-per-task idea is useful.
<!--
A '''virtual memory system''' is one in which running code never deals ''directly'' with physical addresses.


Instead,
One of them is just convenience.  
each task gets its own address space.  
If the OS tells you where to go,  
some sort of translation, between the addresses that the OS/programs see, and the physical addresses and memory that actually goes to, via a lookup table.
you avoid overwriting other tasks accidentally.




No matter the addresses used within each task, they can't clash in physical memory (or rather, ''won't'' overlap until the OS specifically allows it - see shared memory).
Arguably the more important one is '''protected memory''':
if that lookup can easily say "that was never allocated to you, ''denied''",
meaning a task can never accidentally ''or'' intentionally access memory it doesn't own.
{{comment|(There is no overlap in ownership until this is intentional, you specifically ask for it, and the OS specifically allows it - a.k.a. [[shared memory]].)}}


There are a handful of reasons this can be useful. {{comment|(Note: this is a broad-strokes introduction that simplifies and ignores a lot of historical evolution of how we got where we are and ''why'' - a bunch of which I know I don't know)}}.
This is useful for stability,
in that a user task can't bring down a system task accidentally,
as was easy in the "everyone can trample over everyone" days.


Misbehaving tasks will ''probably'' fail in isolation.


The larger among these ideas is '''protected memory''': that lookup can easily say "that is not allocated to you, ''denied''", meaning a task can never accidentally access memory it doesn't own. (once upon a time any program could access any memory, but this has practical issues)
It's also great for security,
in that tasks can't ''intentionally'' access what any other task is doing.


This is useful for stability, in that a user task can't bring down a system task accidentally. Misbehaving tasks will fail in isolation.


It's also great for security, in that tasks can't do it intentionally - you can't read what anyone else is doing.
{{comment|(Note that you can have protection ''without'' virtual addresses, if you keep track of what belongs to a task ''without'' also adding a translation step.
You can also have virtual addresses without protection.  A few embedded systems opt for this because it can be a little simpler (and a little faster) without that an extra step of work or indirection. Yet on general purpose computes you want and get both.)}}


{{comment|(Note that you can have protection ''without'' virtual addresses, if you keep track of what belongs to a task. A few embedded systems opt for this because it can be a little simpler (and a little faster) without that extra step of indirection. Yet in general you get and want both.)}}




Another reason is that processes (and most programmers) don't have to think about other tasks, the OS, or their management.  Say, in the DOS days, you all used the same memory space so memory management was a more cooperative thing -- which is a pain and one of various reasons you would run one thing at a time (with few exceptions).
Another reason is that processes (and most programmers) no longer have to think about other tasks, the OS, or their management.




There are other details, like
There are other details, like
* an OS can effectively unify underlying changes over time, varying hardware lookup/protection implementations, with extensions and variations even in the same CPU architecture/family.
* having this be a separate system means the OS can e.g. change how their memory system works, without programs ever having to change in any way.
: this matters to changes in how hardware implements it (with extensions and variations even in the same CPU architecture/family), but also allows us to work very similarly on dissimilar hardware


* it can make fragments of RAM look contiguous to a process, which makes life much easier for programmers, and has negligible effect on speed (because of the RA in RAM).  
* it can make fragments of RAM look contiguous to a process, which makes life much easier for programmers, and has only a small effect on speed (because of the RA in RAM).
: generally the VMM does try to minimise fragmentation where possible, because too much can trash the fixed-size TLB
: generally the VMM does try to minimise fragmentation where possible, because too much can trash the fixed-size TLB


* on many systems, the first page in a virtual address space is marked unreadable, which is how null pointer references can be caught more easily/efficiently than on systems without MMU/MPUs.
* on many systems, the first page in a virtual address space is marked unreadable, which is how null pointer references can be caught  
:: and why that happens to be easier/more efficient on systems ''with'' a MMU/MPUs.


* In practice it matters that physical/virtual mapping is something a cache system can understand. There are other solutions that are messier.
* In practice it matters that physical/virtual mapping is something a cache system can understand. There are other solutions that are messier.
Line 118: Line 159:
-->
-->


===Swapping / paging; trashing===


===Overcommitting RAM with disk: Swapping / paging; trashing===


<!--
<!--
{{comment|(Note that what windows usually calls paging, unixen usually call swapping. In broad descriptions you can treat them the same. Once you get into the details the workings and terms do vary, and precise use becomes more important.)}}
'''Swapping''' is often understood as "things that could be in RAM are now on disk",
'''swapping in''' and '''swapping out''' as moving it to where it needs to go according to plan.
 
{{comment|Side note: What windows usually calls paging, unixen usually call swapping. In broad descriptions you can treat them the same. Once you get into the details the workings and terms do vary, and precise use becomes more important.)}}
 
 
Yet there are a few other meanings, and the dinstinctions can be important.
 
 
 
There are a few reasons a system might swap


* kernel decided some part of RAM was never accessed, and decide to reserve swap space rather than RAM for it
:: this amounts to just bookkeeping, rather than moving things around


Swapping/paging is, roughly, the idea that the VMM can have a pool of virtual memory that comes from RAM ''and'' disk.
* kernel decided some part of RAM is rarely accessed, and decides to move it to disk


This means you allocate more total memory than would fit in RAM at the same time. {{comment|(It can be considered overcommit of RAM, though note this is ''not'' the usual/only meaning the term overcommit, see below)}}.
* host is out of free RAM, and we're looking for the least-used RAM to move to disk


The VMM decides which parts go from RAM to disk, when, and how much of such disk-based memory there is.
* cgroup is out of RAM; similar idea


* swappiness says it's getting close
* overcommit_ratio
-->
====Overcommitting RAM with disk====
<!--
As mentioned, swapping/paging has the effect that
the VMM can have a pool of virtual memory that could be backed from RAM ''and'' from disk.
''"Can you choose to map or allocate more total memory than would all fit into RAM at the same time?"''
Yes.
And a small degree of this is even ''common''.
Using disk for memory seems like a bad idea, because disks are significantly slower than RAM in both bandwidth and latency.
''Especially'' with platter days, but is still true in the SSD days.
Which is why the VMM will always prefer to use RAM when it has it.
This "...and also disk" can be considered overcommit of RAM,
though note this is ''not'' the only meaning the term overcommit (or even the usual one), see below.


Using disk for memory seems like a bad idea, as disks are significantly slower than RAM in both bandwidth and latency.


Which is why the VMM will always prefer to use RAM.




Line 418: Line 499:
It helps little beyond that. It helps when it means files you read get cached (see [[page cache]]),
It helps little beyond that. It helps when it means files you read get cached (see [[page cache]]),
but beyond that has ''no'' effect.
but beyond that has ''no'' effect.




Line 428: Line 506:


Other notes:
Other notes:
* windows is actually more agressive about swapping things out - it seems to favour favour of IO caches
* windows is actually more aggressive about swapping things out - it seems to favour favour of IO caches
* linux is more tweakable (see [[#swappiness|swappiness]]) and by default is less aggressive.
* linux is more tweakable (see [[#swappiness|swappiness]]) and by default is less aggressive.


Line 448: Line 526:


* swap will not avoid oom_kill altogether - oom_kill is triggered on low speed of freeing pages {{verify}}
* swap will not avoid oom_kill altogether - oom_kill is triggered on low speed of freeing pages {{verify}}




Line 469: Line 544:
-->
-->


===Page faults===
====On memory scarcity====
 
<!--
On a RAM-only system you will find you at some point cannot find free pages.
 
 
When you've added swap and similar features,
you may find your bookkeeping says it can be done,
but in practice it will happen very slowly.
 
 
Also, having disconnected programs from the backing store,
only the kernel can even guess at how bad that is.
 
 
The most obvious case is more pages being actively used than there is physical RAM (can happen without overcommit, more easily with), but there are others. Apparently things like hot database backups may create so many [[dirty pages]] so quickly that the kernel decides it can't free anywhere near fast enough.
 
 
 
In a few cases it's due to a sudden (reasonable) influx of dirty pages, but otherwise transient.
 
But in most cases scarcity is more permanent, means we've started swapping and probably [[trashing]], making everything slow.
 
Such scarcity ''usually'' comes from a single careless / runaway,
sometimes just badly configured (e.g. you told more than one that they could take 80% of RAM), sometimes from a slew of (probably-related) programs.
 
-->
 
 
 
 
<!--
=====SLUB: Unable to allocate memory on node=====
 
SLUB is [[slab allocation]], i.e. about dynamic allocation of kernel memory
 
 
This particular warning seems most related to a bug in memory accounting.
 
 
 
It seems more likely to happen around containers with cgroup kmem accounting,
(not yet stable in 3.x, and apparently there are still footnotes in 4.x)
but happens outside as well?
 
 
There was a kernel memory leak
 
 
-->
 
===="How large should my page/swap space be?"====


<!--
<!--
Consider that in a system with a VMM, applications only ever deal in virtual addresses; it is the VMM that implies translation to real backing storage.


There used to be advice like "Your swap file needs to be 1.5x RAM size", and tables to go along.
The tables's values varying wildly shows just how arbitrary this is.
That they are usually 20 years old more so.


And when doing memory access, one of the options is this access makes sense (the poage is known, and considered accessible) but cannot be accessed in the lightest sort of pass-through-to-RAM way.
It depends significantly with the amount of RAM you have.


A page fault, widely speaking, means "instead of direct access, the kernel needs to decide what to do now".


That signalling is called a page fault {{comment|(Microsoft also uses the term 'hard fault')}}.


Note that it's a ''signal'', caught by the OS kernel. It's called 'fault' only for historical low-level design reasons.
But also, it just depends on use.
 
Generally, the better answer is to consider:
* your active workload, what that may require of RAM in the worst case
: some things are fixed and low (lots of programs)
: some can scale up depending on use (think of editing a high res photo)
: some can make use anything they get (caches, database)
: some languages have their own allocators, which may pre-allocate a few GB but may never use it
: some programs are eager to allocate and less eager to clean up / garbage collect
 
* how much is inactive, so can be swapped out
: less than a GB in most cases, and a few GB in a few
 
* unused swap space doesn't really hurt (other than in disk space)
 
 
This argues that throwing a few GB at it is usually more than enough,
maybe a dozen or two GB when you have hungry/careless programs.
 
 
Servers are sometimes little different.
 
The numbers tend to be bigger - but ideally also more predictable.
 
Tweaking can make more sense because of it.
For example, when two major services try to each use 70% of RAM for caches,
they'll end up pushing each other to swap (and both incur disk latency),
and you're better off halving the size of each,  
when that implies means never involving disk.
 
And ideally you don't swap at all - having to swap something in means whatever's asking is now ''waiting''.
 


Additionally:
* on linux, hibernation reserves space in swap
: so if you use this, you need to add RAM-size to the above
: doesn't apply to windows, it puts hibernation in a separate preallocated file


This can mean one of multiple things. Most said ground is covered by the following cases:
* on windows, a crash dump (when set to dump all RAM) needs it to b [https://support.microsoft.com/en-us/help/2860880/how-to-determine-the-appropriate-page-file-size-for-64-bit-versions-of]
: so if you use this, you need to add RAM-size to the above




'''Minor page fault''', a.k.a. '''soft page fault'''
: Page is actually in RAM, but not currently marked in the MMU page table (often due to its limited size{{verify}})
: resolved by the kernel updating the MMU, ''then'' just allowing the access.
:: No memory needs to be moved around.
:: Very little extra latency
: (can happen around shared memory, or around memory that has been unmapped from processes but there had been no cause to delete it just yet - which is one way to implement a page cache)




'''Major page fault''', a.k.a. '''hard page fault'''
: memory is mapped, but not currently in RAM
: i.e. mapping on request, or loading on demand -- which how you can do overcommit and pretend there is more memory (which is quite sensible where demand rarely happens)
: resolved by the kernel finding some space, and loading that content. Free RAM, which can be made by swapping out another page.
:: Adds noticeable storage, namely that of your backing storage
:: the latter is sort of a "fill one hole by digging another" approach, yet this is only a real problem (trashing) when demand is higher than physical RAM


On basic PC use, at least 1GB is a good idea.


'''Invalid page fault'''
If you have a lot of RAM, you probably did that because you have memory hungry programs, and a few GB more is probably useful.
* memory isn't mapped, and there cannot be memory backing it
: resolved by the kernel raising a [[segmentation fault]] or [[bus error]] signal, which terminates the process


When RAM sizes like 2 or 4GB were typical (you know, ~2000 or so), this amounted to the same thing as 1.5x.


But these days, 1.5x means larger and larger amounts that will probably never bse used.  Which is not in the way, and often not ''that'' much much of a bite out of storage. It's not harmful, it's just y pointless.


DEDUPE WITH ABOVE




or not currently in main memory (often meaning swapped to disk),
"Too large a page file slows down your system"
or it does not currently have the backing memory mapped{{verify}}.


Depending on case, these are typically resolved either by
I don't see how it could.
* mapping the region and loading the content.
: which makes the specific memory access significantly slower than usual, but otherwise fine


* terminating the process
: when it failed to be able to actually fetch it


The only indirect way I can think of is paging behaviour that becomes more aggressive based on actual use but based on how much of each you have. But as far as I know that's not how it works.


Or perhaps if your page file fills your disk space to nearly full, and contributes to fragmentation of regular files.
And even that is less relevant now that SSDs are getting typical.


Reasons and responses include:
* '''minor page fault''' seems to includes:{{verify}}
** MMU was not aware that the page was accessible - kernel inform it is, then allows access
** writing to copy-on-write memory zone - kernel copies the page, then allows access
** writing to page that was promised by the allocated, but needed to be - kernel allocates, then allows access


* mapped file - kernel reads in the requested data, then allows access


* '''major page fault''' refers to:
"Page file of zero is faster because it keeps everything in memory"
** swapped out - kernel swaps it back in, then allows access


* '''invalid page fault''' is basically
True-ish, but not enough to matter in most cases.
** a [[segmentation fault]] - send SIGSEGV (default SIGSEGV hanler is to kill the process)


That is, most accesses, of almost all active programs,
will not be any faster or slower, because most active use come from RAM with or without swap enabled.


Note that most are not errors.


In the case of a [[memory mapped IO]], this is the designed behaviour.
The difference is that
* the rarely-accessed stuff stays in RAM and will not be slower.


* you get less usable RAM, because it now holds everything that is never accessed.
: ...this reduction is sometimes sometimes significant, depending on the specific programs you run, and how much they count on inactive-swap behaviour.


Minor will often happen regularly, because it includes mechanisms that are cheap, save memory, and thereby postpone major page faults.
* if swap means it's going to push things into it instead of deny allocation, then your system is more likely to recover eventually (see oom_kill), and not stop outright.
: this argues for maybe an extra RAM-size, because some programs are just that greedy


Major ideally happens as little as possibly, because memory access is delayed by disk IO.


-->
-->


See also
====Linux====
* http://en.wikipedia.org/wiki/Paging
<!--
 
It seems that *nix swapping logic is smart enough to do basic RAID-like spreading among its swap devices, meaning that a swap partition on every disks that isn't actively used (e.g. by by something important like a database) is probably useful.
 
 
Swap used for hibernation can only come from a swap partition, not a swap file {{comment|(largely because that depends too much on whether the underlying filesystem is mounted)}}.
 
 
Linux allows overcommit, but real world cases vary.
It depends on three things:
* swap space
* RAM size
* overcommit_ratio (defaults to 50%)
 
When swap space is, say, half of RAM size,
 
On servers/workstations with at least dozens of GBs of RAM,
this will easily mean overcommit_ratio should be 80-90 for userspace to be able to use most/all RAM.


* http://en.wikipedia.org/wiki/Page_fault
If the commit limit is lower than RAM, the rest goes (mostly) to caches and buffers.  
* http://en.wikipedia.org/wiki/Demand_paging
Which, note is often useful, and sometimes it can even be preferable to effectively have a little dedicated cache.


===Swappiness===
-->
=====Swappiness=====
{{stub}}
{{stub}}


<!--
<!--
The aggressiveness with with an OS swap out allocated-but-inactive pages to disk is often controllable.


Linux dubs this ''swappiness''. Higher swappiness mean the tendency to swap out is higher. {{comment|(other information is used too, including the currently mapped ratio, and a measure of how much trouble the kernel has recently had freeing up memory)}}{{verify}}
There is an aggressiveness with with an OS will swap out allocated-but-inactive pages to disk.
 
 
This is often controllable.
 
Linux calls this ''swappiness''.
 
Higher swappiness mean the general tendency to swap out is higher.  


This general swappiness is combined with other (often more volatile) information,
including the system's currently mapped ratio,
a measure of how much trouble the kernel has recently had freeing up memory,
and some per-process (per-page) statistics.




Line 569: Line 742:


The cost is mainly the time spent,
The cost is mainly the time spent,
the benefit giving RAM to caches, and to programs (then also doing some swapping now rather than later).
the benefit is largely giving more RAM to programs and caches (then also doing some swapping now rather than later).


(note that linux swaps less aggressive than windows to start with, at least with default settings)
(note that linux swaps less aggressively than windows to start with - at least with default settings)




There are always pages that are inactive simply because programs very rarely use it (80/20-like access patterns).
There are always pages that are inactive simply because programs very rarely use it ([[80/20]]-like access patterns).


But with plenty of free free RAM it might not even swap ''those'' out, because benefit is so low.
But if you have plenty of free RAM it might not even swap ''those'' out, because benefit is estimated to be low.
I had 48GB and 256GB workstations at work and people rarely got them to swap ''anything''.
: I had 48GB and 256GB workstations at work and people rarely got them to swap ''anything''.




Line 629: Line 802:
'''On caches'''
'''On caches'''


Swappiness applies mostly to process's memory, and not to kernel constructs like the OS page cache, dentry cache, and inode cache.
Swappiness applies mostly to process's memory, and not to kernel constructs like the OS [[page cache]] (and [[dentry cache]], and [[inode cache]]).




Line 759: Line 932:


-->
-->
===Practical notes===
====Linux====
<!--
It seems that *nix swapping logic is smart enough to do basic RAID-like spreading among its swap devices, meaning that a swap partition on every disks that isn't actively used (e.g. by by something important like a database) is probably useful.
Swap used for hibernation can only come from a swap partition, not a swap file {{comment|(largely because that depends too much on whether the underlying filesystem is mounted)}}.
Linux allows overcommit, but real world cases vary.
It depends on three things:
* swap space
* RAM size
* overcommit_ratio (defaults to 50%)
When swap space is, say, half of RAM size,
On servers/workstations with at least dozens of GBs of RAM,
this will easily mean overcommit_ratio should be 80-90 for userspace to be able to use most/all RAM.
If the commit limit is lower than RAM, the rest goes (mostly) to caches and buffers.
Which, note is often useful, and sometimes it can even be preferable to effectively have a little dedicated cache. 
-->
===="How large should my page/swap space be?"====
<!--
Depends on use.
Generally, the better answer is to consider:
* your active workload, what that may require of RAM in the worst case
: some things are fixed and low (lots of programs)
: some can scale up depending on use (think of editing a high res photo)
: some can make use anything they get (caches, database)
: some languages have their own allocators, which may pre-allocate a few GB but may never use it
: some programs are eager to allocate and less eager to clean up / garbage collect
* how much is inactive, so can be swapped out
: less than a GB in most cases, and a few GB in a few
* too much swap doesn't really hurt
This argues that throwing a few GB at it is usually more than enough,
maybe a dozen or two GB when you have hungry/careless programs.
Servers are sometimes little different.
The numbers tend to be bigger - but ideally also more predictable.
And tweaking can make more sense because of it.
For example, when two major services try to each use 70% of RAM for caches,
they'll end up pushing each other to swap (and both incur disk latency),
and you're better off halving the size of each,
when that implies means never involving disk.
Additionally:
* on linux, hibernation reserves space in swap
: so if you use this, you need to add RAM-size to the above
: doesn't apply to windows, it puts hibernation in a separate preallocated file
* on windows, a crash dump (when set to dump all RAM) needs it to b [https://support.microsoft.com/en-us/help/2860880/how-to-determine-the-appropriate-page-file-size-for-64-bit-versions-of]
: so if you use this, you need to add RAM-size to the above
"Your swap file needs to be 1.5x RAM size"
This is arbitrary.
As shown by that number changing over time.
I've even seen tables that also vary that factor significantly with RAM.
On basic PC use, at least 1GB is a good idea.
If you have a lot of RAM, you probably did that because you have memory hungry programs, and a few GB more is probably useful.
When RAM sizes like 2 or 4GB were typical (you know, ~2000 or so), this amounted to the same thing as 1.5x.
But these days, 1.5x means larger and larger amounts that will probably never bse used.  Which is not in the way, and often not ''that'' much much of a bite out of storage. It's not harmful, it's just y pointless.
"Too large a page file slows down your system"
I don't see how it could.
The only indirect way I can think of is paging behaviour that becomes more aggressive based on actual use but based on how much of each you have. But as far as I know that's not how it works.
Or perhaps if your page file fills your disk space to nearly full, and contributes to fragmentation of regular files.
And even that is less relevant now that SSDs are getting typical.
"Page file of zero is faster because it keeps everything in memory"
True-ish, but not enough to matter in most cases.
That is, most accesses, of almost all active programs,
will not be any faster or slower, because most active use come from RAM with or without swap enabled.
The difference is that
* the rarely-accessed stuff stays in RAM and will not be slower.
* you get less usable RAM, because it now holds everything that is never accessed.
: ...this reduction is sometimes sometimes significant, depending on the specific programs you run, and how much they count on inactive-swap behaviour.
* if swap means it's going to push things into it instead of deny allocation, then your system is more likely to recover eventually (see oom_kill), and not stop outright.
: this argues for maybe an extra RAM-size, because some programs are just that greedy
-->
===On memory scarcity===
<!--
On a RAM-only system you will find you at some point cannot find free pages.
When you've added swap and similar features,
you may find your bookkeeping says it can be done,
but in practice it will happen very slowly.
Also, having disconnected programs from the backing store,
only the kernel can even guess at how bad that is.
The most obvious case is more pages being actively used than there is physical RAM (can happen without overcommit, more easily with), but there are others. Apparently things like hot database backups may create so many [[dirty pages]] so quickly that the kernel decides it can't free anywhere near fast enough.
In a few cases it's due to a sudden (reasonable) influx of dirty pages, but otherwise transient.
But in most cases scarcity is more permanent, means we've started swapping and probably [[trashing]], making everything slow.
Such scarcity ''usually'' comes from a single careless / runaway,
sometimes just badly configured (e.g. you told more than one that they could take 80% of RAM), sometimes from a slew of (probably-related) programs.
-->


=====oom_kill=====
=====oom_kill=====


<tt>oom_kill</tt> is linux kernel code that starts killing processes when there is enough memory scarcity that memory allocations cannot happen within reasonable time - as this is good indication that it's gotten to the point that we are trashing.
<tt>oom_kill</tt> is linux kernel code that starts killing processes when there is enough memory scarcity that memory allocations cannot happen within reasonable time (because that usually means we are already [[trashing]]).




Line 969: Line 991:
See also
See also
* http://mirsi.home.cern.ch/mirsi/oom_kill/index.html
* http://mirsi.home.cern.ch/mirsi/oom_kill/index.html
===Page faults===
<!--
{{zzz|For context:
Consider that in a system with a Virtual Memory Manager (VMM),
applications only ever deal in {{comment|(what in the grander scheme turns out to be)}} virtual addresses;
it is the VMM system's implementation that implies/does translation to real backing storage.
So when doing any memory access, one of the tasks is making this access makes sense. For example, it may be that
* the page is not known
* the page is known, but not considered accessible
* the page is known, considered accessible, and in RAM
* the page is known, considered accessible, but cannot be accessed in the lightest sort of pass-through-to-RAM way.}}




A '''page fault''', widely speaking, means "instead of direct access, the kernel needs to decide what to do now" - that last case {{comment|(some of the the others have their own names)}}.


<!--
That signalling is called a page fault {{comment|(Microsoft also uses the term 'hard fault')}}.
=====SLUB: Unable to allocate memory on node=====
 
Note that it's a ''signal'' (caught by the OS kernel), and called a 'fault' only for historical, almost-electronic-level design reasons.
 
 
A page fault can still mean one of multiple things.
 
Most (but not even all) ground is covered by the following cases:
 
 
'''Minor page fault''', a.k.a. '''soft page fault'''
: Page is actually in RAM, but not currently marked in the MMU page table (often due to its limited size{{verify}})
: resolved by the kernel updating the MMU with the knowledge it needs, ''then'' probably allowing the access.
:: No memory needs to be moved around.
:: little extra latency
: (can happen around shared memory, or around memory that has been unmapped from processes but there had been no cause to delete it just yet - which is one way to implement a page cache)
 
 
'''Major page fault''', a.k.a. '''hard page fault'''
: memory is mapped, but not currently in RAM
: i.e. mapping on request, or loading on demand -- which how you can do overcommit and pretend there is more memory (which is quite sensible where demand rarely happens)
: resolved by the kernel finding some space, and loading that content. Free RAM, which can be made by swapping out another page.
:: Adds noticeable storage, namely that of your backing storage
:: the latter is sort of a "fill one hole by digging another" approach, yet this is only a real problem (trashing) when demand is higher than physical RAM
 
 
'''Invalid page fault'''
* memory isn't mapped, and there cannot be memory backing it
: resolved by the kernel raising a [[segmentation fault]] or [[bus error]] signal, which terminates the process
 
 
 
DEDUPE WITH ABOVE
 
 
or not currently in main memory (often meaning swapped to disk),
or it does not currently have the backing memory mapped{{verify}}.
 
Depending on case, these are typically resolved either by
* mapping the region and loading the content.
: which makes the specific memory access significantly slower than usual, but otherwise fine
 
* terminating the process
: when it failed to be able to actually fetch it
 
 
 
Reasons and responses include:
* '''minor page fault''' seems to includes:{{verify}}
** MMU was not aware that the page was accessible - kernel inform it is, then allows access
** writing to copy-on-write memory zone - kernel copies the page, then allows access
** writing to page that was promised by the allocated, but needed to be - kernel allocates, then allows access


SLUB is [[slab allocation]], i.e. about dynamic allocation of kernel memory
* mapped file - kernel reads in the requested data, then allows access


* '''major page fault''' refers to:
** swapped out - kernel swaps it back in, then allows access


This particular warning seems most related to a bug in memory accounting.
* '''invalid page fault''' is basically
** a [[segmentation fault]] - send SIGSEGV (default SIGSEGV hanler is to kill the process)




Note that most are not errors.


It seems more likely to happen around containers with cgroup kmem accounting,
In the case of a [[memory mapped IO]], this is the designed behaviour.
(not yet stable in 3.x, and apparently there are still footnotes in 4.x)
but happens outside as well?




There was a kernel memory leak
Minor will often happen regularly, because it includes mechanisms that are cheap, save memory, and thereby postpone major page faults.


Major ideally happens as little as possibly, because memory access is delayed by disk IO.


-->
-->
See also
* http://en.wikipedia.org/wiki/Paging
* http://en.wikipedia.org/wiki/Page_fault
* http://en.wikipedia.org/wiki/Demand_paging


===Copy on write===
===Copy on write===
Line 1,071: Line 1,168:


-->
-->


===Glossary===
===Glossary===

Latest revision as of 18:06, 22 April 2024

The lower-level parts of computers

General: Computer power consumption · Computer noises

Memory: Some understanding of memory hardware · CPU cache · Flash memory · Virtual memory · Memory mapped IO and files · RAM disk · Memory limits on 32-bit and 64-bit machines

Related: Network wiring notes - Power over Ethernet · 19" rack sizes

Unsorted: GPU, GPGPU, OpenCL, CUDA notes · Computer booting



This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Intro

Swapping / paging; trashing

Overcommitting RAM with disk

On memory scarcity

"How large should my page/swap space be?"

Linux

Swappiness
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


oom_kill

oom_kill is linux kernel code that starts killing processes when there is enough memory scarcity that memory allocations cannot happen within reasonable time (because that usually means we are already trashing).


Killing processes sounds like a poor solution.

But consider that an OS can deal with completely running out of memory in roughly three ways:

  • deny all memory allocations until the scarcity stops.
This isn't very useful because
it will affect every program until scarcity stops
if the cause is one flaky program - and it usually is just one - then the scarcity may not stop
programs that do not actually check every memory allocation will probably crash.
programs that do such checks well may have no option but to stop completely (maybe pause)
So in the best case, random applications will stop doing useful things - probably crash, and in the worst case your system will crash.
  • delay memory allocations until they can be satisfied
This isn't very useful because
this pauses all programs that need memory (they cannot be scheduled until we can give them the memory they ask for) until scarcity stops
again, there is often no reason for this scarcity to stop
so typically means a large-scale system freeze (indistinguishable from a system crash in the practical sense of "it doesn't actually do anything")
  • killing the misbehaving application to end the memory scarcity.
This makes a bunch of assumptions that have to be true -- but it lets the system recover
assumes there is a single misbehaving process (not always true, e.g. two programs allocating most of RAM would be fine individually, and needs an admin to configure them better)
...usually the process with the most allocated memory, though oom_kill logic tries to be smarter than that.
assumes that the system has had enough memory for normal operation up to now, and that there is probably one haywire process (misbehaving or misconfigured, e.g. (pre-)allocates more memory than you have)
this could misfire on badly configured systems (e.g. multiple daemons all configured to use all RAM, or having no swap, leaving nothing to catch incidental variation)


Keep in mind that

  • oom_kill is sort of a worst-case fallback
generally
if you feel the need to rely on the OOM, don't.
if you feel the wish to overcommit, don't
oom_kill is meant to deal with pathological cases of misbehaviour
but even then might pick some random daemon rather than the real offender, because in some cases the real offender is hard to define
note that you can isolate likely offenders via cgroups now (also meaning that swapping happens per cgroup)
and apparently oom_kill is now cgroups-aware
  • oom_kill does not always save you.
It seems that if your system is trashing heavily already, it may not be able to act fast enough.
(and possibly go overboard once things do catch up)
  • You may wish to disable oom_kill when you are developing
...or at least equate an oom_kill in your logs as a fatal bug in the software that caused it.
  • If you don't have oom_kill, you may still be able to get reboot instead, by setting the following sysctls:
vm.panic_on_oom=1

and a nonzero kernel.panic (seconds to show the message before rebooting)

kernel.panic=10


See also

Page faults

See also

Copy on write

Glossary