Revision as of 11:44, 10 July 2023

On memory fragmentation

Fragmentation in general

Slab allocation

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The slab allocator does caches of fixed-size objects.

Slab allocation is often used in kernel modules/drivers that are perfectly fine to allocate only uniform-sized and potentially short-lived structures - think task structures, filesystem internals, network buffers.

Fixed size, and often separated for each specific type, makes it easier to write an allocator that guarantees allocation within very small timeframe (by avoiding "hey let me look at RAM and all the allocations currently in there" - you can keep track of slots being taken or not with a simple bitmask, and it cannot fragment).

There may also be arbitrary allocation not for specific data structures but for fixed sizes like 4K, 8K, 32K, 64K, 128K, etc, used for things that have known bounds but not precise sizes, for similar lower-time-overhead allocation at the cost of some wasted RAM.

Upsides:

Each such cache is easy to handle

avoids fragmentation because all holes are of the same size,

that the otherwise-typical buddy system still has

making slab allocation/free simpler, and thereby a little faster

easier to fit them to hardware caches better

Limits:

It still deals with the page allocator under the cover, so deallocation patterns can still mean that pages for the same cache become sparsely filled - which wastes space.

SLAB, SLOB, SLUB:

SLOB: K&R allocator (1991-1999), aims to allocate as compactly as possible. But fragments faster than various others.
SLAB: Solaris type allocator (1999-2008), as cache-friendly as possible.
SLUB: Unqueued allocator (2008-today): Execution-time friendly, not always as cache friendly, does defragmentation (mostly just of pages with few objects)

For some indication of what's happening, look at slabtop and slabinfo

On computer memory: Difference between revisions

Revision as of 11:44, 10 July 2023

On memory fragmentation

Fragmentation in general

Slab allocation

Navigation menu

@@ Line 90: / Line 90: @@
 to object allocators in certain languages,
 arguably even just the implementation of certain data structures.
-=Memory limits on 32-bit and 64-bit machines=
-{{stub}}
-tl;dr:
-* If you want to use significantly more than 4GB of RAM, you want a 64-bit OS.
-* ...and since that is now typical, most of the details below are irrelevant
-TODO: the distinction between (effects from) physical and virtual memory addressing should be made clearer.
-<!--
-Physical memory addressing
-* not so v
-* is complicated by the device hole
-Virtual memory:
-* means per-process page tables (virtual-physical mapping, managed by the OS, consulted by the processor)
-* means even the kernel and its helpers has to be mapped this way
-* {{comment|(for stability/security reasons we want to protect the kernel from accesses, so)}} there is a kernel/user split - that is, the OS reserves a virtual address range (often at 3GB or 2GB)
-Things neither of those directly draws in but do affect (and often vary by OS):
-* memory mapping
-* shared libraries
--->
-===Overall factoids===
-'''OS-level and hardware-level details:'''
-From the '''I want my processes to map as much as possible''' angle:
-* the amount of memory ''a single process'' could hope to map is typically limited by its pointer size, so ~4GB on 32-bit OS, 64-bit (lots) on a 64-bit OS.
-:: Technically this could be entirely about the OS, but in reality this tied intimately to what the hardware natively does, because anything else would be ''slooow''.
-* Most OS kernels have a split (for their own ease) that means that of the area a program can map, less is allocatable - to perhaps 3GB, 2GB sometimes even 1GB
-: this is partly a pragmatic implementation detail from back when 32 ''mega''bytes was a ''lot'' of memory, and leftover ever since
-* Since the OS is in charge of virtual memory, it ''can'' map each process to memory separately, so in theory you can host multiple 32-bit processes to ''together'' use more than 4GB
-: ...even on 32-bit OSes: you can for example compile the 32-bit linux kernel to use up to 64GB this way
-:: a 32-bit OS can only do this through '''PAE''', which has to be supported and enabled in motherboard, and supported and enabled in the OS.
-:: Note: both 32-bit and 64-bit PAE-supporting motherboards ''may'' have somewhat strange limitations, e.g. the amount of memory they will actually allow/support {{comment|(mostly a problem in early PAE motherboards)}}
-:: and PAE was problematic anyway - it's a nasty hack in nature, and e.g. drivers ''had'' to support it. It was eventually disabled in consumer windows (XP) for this reason. In the end it was mostly seen in servers, where the details were easier to oversee.
-* device memory maps would take mappable memory away from within each process, which for 32-bit OSes would often mean that you couldn't use all of that installed 4GB
-'''On 32-bit systems:'''
-Process-level details:
-* No ''single'' 32-bit process can ever map more than 4GB as addresses are 32-bit byte-addressing things.
-* A process's address space has reserved parts, to map things like shared libraries, which means a single app can actually ''allocate'' less (often by at most a few hundred MBs) than what it can map{{verify}}. Usually no more than ~3GB can be allocated, sometimes less.
-'''On 64-bit systems:'''
-* none of the potentially annoying limitations that 32-bit systems have apply
-: (assuming you are using a 64-bit OS, and not a 32-bit OS on a 64-bit system).
-* The architecture lets you map 64-bit addresses
-: ...in theory, anyway. The instruction set is set up for 64 bit everything, but the current x86-64 CPU implementation's address lines are 48-bit (for 256TiB), mainly because we can increase that later without breaking compatibility, and right now it saves copper and silicon 99% of computers won't use
-: ...because in practice it's still more than you can currently physically put in most systems. {{comment|(there are a few supercomputers for which this matters, but arguably even there it's not so important because horizontal scaling is ''generally'' more useful than vertical scaling. But there are also a few architectures designed with a larger-than-64-bit addressing space)}}
-On both 32-bit (PAE) and 64-bit systems:
-* Your motherboard may have assumptions/limitations that impose some lower limits than the theoretical one.
-* Some OSes may artificially impose limits (particularly the more basic versions of Vista seem to do this{{verify}})
-Windows-specific limitations:
-* 32-bit Windows XP (since SP2) gives you '''no PAE memory benefits'''. You may still be using the PAE version of the kernel if you have DEP enabled (no-execute page protection) since that requires PAE to work{{verify}}, but PAE's memory upsides are '''disabled''' {{comment|(to avoid problems with certain buggy PAE-unaware drivers, possibly for other reasons)}}
-* 64-bit Windows XP: ?
-* /3GB switch moves the user/kernel split, but a single process to map more than 2GB must be 3GB aware
-* Vista: different versions have memory limits that seem to be purely artificial (8GB, 16GB, 32GB, etc.) {{comment|(almost certainly out of market segregation)}}
-===Longer story / more background information===
-A 32-bit machine implies memory addresses are 32-bit, as is the memory address bus to go along. It's more complex, but the net effect is still that you can ask for 2^32 bytes of memory at byte resolution, so technically allows you to access up to 4GB.
-The 'but' you hear coming is that 4GB of address space doesn't mean 4GB of memory use.
-====The device hole (32-bit setup)====
-One of the reasons the limit actually lies lower is devices. The top of the 4GB memory space (usually directly under the 4GB position) is used to map devices.
-If you have close to 4GB of memory, this means part of your memory is still not addressible by the CPU, and effectively missing.
-The size of this hole depends on the actual devices, chipset, BIOS configuration, and more{{verify}}.
-The BIOS settles the memory address map{{verify}}, and you can inspect the effective map {{comment|(Device Manager in windows, /proc/iomem in linux)}} in case you want to know whether it's hardware actively using the space {{comment|(The hungriest devices tend to be video cards - at the time having two 768MB nVidia 8800s in SLI was one of the worst cases)}} or whether your motherboard just doesn't support more than, say, 3GB at all.
-Both these things can be the reason some people report seeing as little as 2.5GB out of 4GB you plugged in.
-This problem goes away once you run a 64-bit OS on a 64-bit processor -- though there were some earlier motherboards that still had old-style addressing leftovers and hence some issues.
-Note that the subset of these issues caused purely by limited address space on 32-bit systems could also be alleviated, using PAE:
-====PAE====
-It is very typical to use virtual memory systems.
-While the prime upside is probably the isolation of memory, the fact that a memory map is kept for each process also means that on 32-bit, each application has its ''own'' 4GB memory map without interfering with anything else (virtual mapping practice allowing).
-Which means that while each process could use 4GB at the very best, if the OS could see more memory, it might map distinct 4GBs to each process so that ''collectively'' you can use more than 4GB (or just your full 4GB even with device holes).
-Physical Address Extension is a memory mapping extension (not a hack, as some people think) that does roughly that.
-PAE needs specific OS support, but ''doesn't'' need to break the 32-bit model as applications see it.
-It allowed mapping 32-bit virtual memory into the 36 bit hardware address space, which allows for 64GB {{comment|(though most motherboards had a lower limit)}}
-PAE implies some extra work on each memory operation, but because there's hardware support it only kicked a few percent off memory access speed.
-All newish linux and windows version support PAE, at least technically.
-However:
-* The CPU isn't the only thing that accesses memory. Although many descriptions I've read seem kludgey, I easily believe that any device driver that does DMA and is not aware of PAE may break things -- such drivers are broken in that they are not PAE-aware - they do not know the 64-bit pointers that are used internally used should be limited to 36-bit use.
-* PAE was '''disabled''' in WinXP's SP2 to increase stability related to such issues, while server windowses are less likely to have problems since they use tend to use more standard hardware and thereby drivers.
-====Kernel/user split====
-{{stub}}
-The kernel/user split, specific to 32-bit OSes, refers to an OS-enforced formalism splitting the mappable process space between kernel and each process.
-It looks like windows by default gives 2GB to both, while (modern) linuces apparently split into 1GB kernel, 3GB application by default {{comment|(which is apparently rather tight on AGP and a few other things)}}.
-(Note: '3GB for apps' means that any ''single'' process is limited to map 3GB. Multiple processes may sum up to whatever space you have free.)
-In practice you may want to shift the split, particularly in Windows since almost everything that would want >2GB memory runs in user space - mostly databases.
-{{comment|The exception is Terminal Services (Remote Desktop), that seems to be kernel space.}}
-It seems that:
-* linuxes tend to allow 1/3, 2/2 and 3/1,
-* BSDs allow the split to be set to whatever you want{{verify}}.
-* It seems{{verify}} windows can only shift its default 2/2 to the split to 1GB kernel, 3GB application, using the /3GB boot option {{comment|(the feature is somewhat confusingly called 4GT)}}, but it seems that windows applications are normally compiled with the 2/2 assumption and will not be helped unless coded to. Exceptions seem to primarily include database servers.
-* You may be able to work around it with a 4G/4G split patch, combined with PAE - with some overhead.
-===See also===
-* http://www.dansdata.com/askdan00015.htm
-* http://linux-mm.org/HighMemory
-* [http://www-128.ibm.com/developerworks/linux/library/l-memmod/ Explore the Linux memory mode]
-* http://www.spack.org/wiki/LinuxRamLimits
-* http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory
-* http://kerneltrap.org/node/2450
-* http://en.wikipedia.org/wiki/3_GB_barrier
-<!--
-==Motherboards==
-Integrated graphics used to mean a chip on the motherboard (in or near the northbridge{{verify}}).
-This often meant the cheapest option a motherboard manufacturer could find, which is nice and minimal if you have no needs beyond office stuff and a bit of web browsing, but not enough for any shiny graphics worth staring at.
-It also ate some of your CPU, main memory.
-More recent integrated graphics is actually inside the CPU, and seem to be more like entry-level graphics cards and can play more games than the motherboard integrated graphics could.
-Also, the implications that there are very few options also means they are much clearer options.
-Gamers will always want a more serious video card - typically even something costing a few dozen bucks will be nicer.
-===PCI, PCI-Express===
-PCI Express (PCIe) is an already-common standard designed to replace the older PCI, PCI-X, and AGP standards {{comment|(PCI-X is PCI-eXtended, which was a variant on PCI, largely seen on servers, before there was PCIe. AGP was mostly used for video cards)}}
-PCI was enough for most low-bandwidth cards, but started creaking for some applications a while ago (video capture, gbit ethernet, and such).
-PCIe means more bandwidth, and is less of a single shared bus and more of a point to point thing (and theoretically more of a switched network thing{{verify}}), and is also symmetric and full duplex (can do the speed in both directions, and at the same time).
-: '''On PCIe speeds speeds and slots'''
-The slot basically consists of a small chunk of power (and simple management bus stuff), a bit of plastic, and the rest being the data lanes. You can eyeball what sort of slot you have by the size of the lane part.
-The common slots:
-* x1 (250MB/sec on PCIe 1.x) already much faster than PCI, and fast enough for many things.
-* x4 (500MB/sec on PCIe 1.x) used by some higher-speed devices (e.g. multi-port GBit controllers), some RAID controllers, and such
-* x16 (4GB/sec on PCIe 1.x) is used by video cards, some RAID controllers, and such
-* (x2, x8 and x32 exist, but are not seen very often)
-You can always plug PCIe cards into a larger slots.
-* Speeds can refer both to a slot (its size is largely dictated by its lanes{{verify}}), and the speed that it can do.
-** Which isn't always the same. There are e.g. motherboards with x16 slots that only do x8 speeds. ...for example because x16 was faster than most CPU and memory bus speeds at the time of introduction, which would make true x16 a waste of your money.
-PCIe specs actually mention gigatransfers/sec. Given byte lanes, and assuming [http://en.wikipedia.org/wiki/8b/10b_encoding 8b/10b] coding, this means dividing the GT/s figure by 10 to get MByte/s.
-The speeds mentioned above are for PCIe 1, which can do 2.5 GT/s per lane.
-For comparison:
-* v1.x: 250 MByte/s/lane (2.5 GT/s/lane)
-* v2.x: 500 MByte/s/lane (5 GT/s/lane)
-* v3.0: 1 GByte/s/lane (8 GT/s/lane)
-* v4.0: 2 GByte/s/lane (16 GT/s/lane)
-Note that both device and motherboard need to support the higher PCIe variant to actually use these speeds.
--->