@@ Line 1: / Line 1: @@
+#redirect [[Template:Computer hardware]]
-=On memory fragmentation=
-==Fragmentation in general==
-<!--
-Since you don't want to manage all memory by yourself, you usually want something to provide an allocator for you, so you can ask "I would like X amount of contiguous memory now, please" and it'll figure out where it needs to come from.
-Since an allocator will serve many such requests over time, and programs ask for fairly arbitrary-sized chunks,
-over time these allocations end up somewhat arbitrarily positioned, with arbitrarily sized holes in-between.
-These holes will serve smaller but not larger requests, and mean programs get chunks that are fairly randomly positioned in physical memory.
-Fragmentation in general can be bad for a few reasons:
-* allocation becoming slightly slower because its bookkeeping gets more complex over time
-* slightly lower access speed due to going to more places - e.g. when you think things are in sequence but actually comes from different places in RAM
-: since most memory is random-access (but there are details like hardware caches), the overhead is small and the difference is small
-* holes that develop over time means the speed at which memory fragments increases over time
-This is basically irrelevant for physical memory, mostly because the translation between physical and virtual memory done for us.
-Physical memory can fragment all it wants, the VMM can make it look entirely linear in terms of addressing we see, so there is almost no side effect on the RAM side.
-It's still kept simple because the mechanism that ''does'' those RAM accesses involves a lookup table,
-and it's better if most of that lookup table can stay in hardware (if interested, look for how the TLB works).
-But all this matters more to people writing operating systems.
-So instead, memory fragmentation typically refers to virtual address fragmentation.
-Which refers to an app using ''its'' address space badly.
-For example, its heap may, over time with many alloc()s and free()s, grow holes that fewer allocations can be served by, meaning some percentage of its commited memory won't often be used.
-Note that due to swapping, even this has limited ef
-Even this isn't all too bad, due to
--->
-==Slab allocation==
-{{stub}}
-The slab allocator does caches of fixed-size objects.
-Slab allocation is often used in kernel modules/drivers that are perfectly fine to allocate only uniform-sized and potentially short-lived structures - think task structures, filesystem internals, network buffers.
-Fixed size, and often separated for each specific type, makes it easier to write an allocator that guarantees allocation within very small timeframe (by avoiding "hey let me look at RAM and all the allocations currently in there" - you can keep track of slots being taken or not with a simple bitmask, and it ''cannot'' fragment).
-There may also be arbitrary allocation not for specific data structures but for fixed sizes like 4K, 8K, 32K, 64K, 128K, etc, used for things that have known bounds but not precise sizes, for similar lower-time-overhead allocation at the cost of some wasted RAM.
-Upsides:
-: Each such cache is easy to handle
-: avoids fragmentation because all holes are of the same size,
-:: that the otherwise-typical [https://en.wikipedia.org/wiki/Buddy_memory_allocation buddy system] still has
-: making slab allocation/free simpler, and thereby a little faster
-: easier to fit them to hardware caches better
-Limits:
-: It still deals with the page allocator under the cover, so deallocation patterns can still mean that pages for the same cache become sparsely filled - which wastes space.
-SLAB, SLOB, SLUB:
-* SLOB: K&R allocator (1991-1999), aims to allocate as compactly as possible. But fragments faster than various others.
-* SLAB: Solaris type allocator (1999-2008), as cache-friendly as possible.
-* SLUB: Unqueued allocator (2008-today): Execution-time friendly, not always as cache friendly, does defragmentation (mostly just of pages with few objects)
-For some indication of what's happening, look at {{inlinecode|slabtop}} and {{inlinecode|slabinfo}}
-See also:
-* http://www.secretmango.com/jimb/Whitepapers/slabs/slab.html
-* https://linux-mm.org/PageAllocation
-There are some similar higher-level allocators "I will handle things of the same type" allocation,
-from some custom allocators in C,
-to object allocators in certain languages,
-arguably even just the implementation of certain data structures.
-=Memory mapped IO and files=
-{{stub}}
-Note that
-: memory mapped IO is a hardware-level construction, while
-: memory mapped files are a software construction (...because files are).
-===Memory mapped files===
-Memory mapping of files is a technique (OS feature, system call) that pretends a file is accessible at some address in memory.
-When the process accesses those memory locations, the OS will scramble for the actual contents from disk.
-Whether this will then be cached depends a little on the OS and details{{verify}}.
-====For caching====
-In e.g. linux you get that interaction with the page cache, and the data is and stays cached as long as there is RAM for it.
-This can also save memory - in that ''without'' memory mapping,
-compared to the easy choice of manually cacheing the entire thing in your process.
-With mmap you may cache only the parts you use, and if multiple processes want this file, you may avoid a little duplication.
-The fact that the OS can flush most or all of this data can be seen as a limitation or a feature - it's not always predictable, but it does mean you can deal with large data sets without having to think about very large allocations, and how those aren't nice to other apps.
-====shared memory via memory mapped files====
-Most kernel implementations allow multiple processes to mmap the same file -- which effectively shares memory, and probably one of the simplest in a [http://en.wikipedia.org/wiki/Protected_mode protected mode] system.
-{{comment|(Some methods of [[Inter-Process communication]] work via mmapping)}}
-Not clobbering each other's memory is still something you need to do yourself.
-The implementation, limitations, and method of use varies per OS / kernel.
-Often relies on [http://en.wikipedia.org/wiki/Demand_paging demand paging] to work.
-===Memory mapped IO===
-Map devices into memory space (statically or dynamically), meaning that memory accesses to those areas are actually backed by IO accesses (...that you can typically also do directly).
-This mapping is made and resolved at hardware-level thing, and only works for [[DMA]]-capable devices (which is many).
-It seems to often be done to have a simple generic interface {{verify}} - it means drivers and software can avoid many hardware-specific details.
-See also:
-* http://en.wikipedia.org/wiki/Memory-mapped_I/O
-[[Category:Programming]]
-===DMA===
-{{stub}}
-Direct Memory Access comes down to additional hardware that can be programmed to copy bytes from one memory address to another,
-meaning the CPU doesn't have to do this.
-DMA is independent enough at hardware level that its transfers can work at high clocks and throughputs (and without interrupting other work), comparable to CPU copies (CPU may be faster if it was otherwise idle. When CPU is not idle the extra context switching may slow things down and DMA may be relatively free. Details vary with specific designs, though).
-They tend to work in smaller chunks, triggered by DRQ {{comment|(similar in concept to IRQs, but triggering only a smallish copy, rather than arbitrary code)}}, so that it can be coordinated as small chunks.
-The details look intimidating at first, but mostly because they are low-level.
-The idea is actually ''relatively'' simple.
-Aside from memory-to-memory use, it also allows memory-to-peripheral copies (if a specific supporting device is [[memory mapped]]{{verify}}).
-<!--
-Presumably DMA hardware grew more generic over time, to be a more controllable subsystem
--->
-<!--
-Consider e.g. I2S output, which
-: you just couldn't (easily) get to be regular without it being independent.
-:
-There's an interesting discussion of how this works in an STM32 microcontroller at http://cliffle.com/blog/pushing-pixels/
-http://en.wikipedia.org/wiki/Direct_memory_access
--->
-=Memory limits on 32-bit and 64-bit machines=
-{{stub}}
-tl;dr:
-* If you want to use significantly more than 4GB of RAM, you want a 64-bit OS.
-* ...and since that is now typical, most of the details below are irrelevant
-TODO: the distinction between (effects from) physical and virtual memory addressing should be made clearer.
-<!--
-Physical memory addressing
-* not so v
-* is complicated by the device hole
-Virtual memory:
-* means per-process page tables (virtual-physical mapping, managed by the OS, consulted by the processor)
-* means even the kernel and its helpers has to be mapped this way
-* {{comment|(for stability/security reasons we want to protect the kernel from accesses, so)}} there is a kernel/user split - that is, the OS reserves a virtual address range (often at 3GB or 2GB)
-Things neither of those directly draws in but do affect (and often vary by OS):
-* memory mapping
-* shared libraries
--->
-===Overall factoids===
-'''OS-level and hardware-level details:'''
-From the '''I want my processes to map as much as possible''' angle:
-* the amount of memory ''a single process'' could hope to map is typically limited by its pointer size, so ~4GB on 32-bit OS, 64-bit (lots) on a 64-bit OS.
-:: Technically this could be entirely about the OS, but in reality this tied intimately to what the hardware natively does, because anything else would be ''slooow''.
-* Most OS kernels have a split (for their own ease) that means that of the area a program can map, less is allocatable - to perhaps 3GB, 2GB sometimes even 1GB
-: this is partly a pragmatic implementation detail from back when 32 ''mega''bytes was a ''lot'' of memory, and leftover ever since
-* Since the OS is in charge of virtual memory, it ''can'' map each process to memory separately, so in theory you can host multiple 32-bit processes to ''together'' use more than 4GB
-: ...even on 32-bit OSes: you can for example compile the 32-bit linux kernel to use up to 64GB this way
-:: a 32-bit OS can only do this through '''PAE''', which has to be supported and enabled in motherboard, and supported and enabled in the OS.
-:: Note: both 32-bit and 64-bit PAE-supporting motherboards ''may'' have somewhat strange limitations, e.g. the amount of memory they will actually allow/support {{comment|(mostly a problem in early PAE motherboards)}}
-:: and PAE was problematic anyway - it's a nasty hack in nature, and e.g. drivers ''had'' to support it. It was eventually disabled in consumer windows (XP) for this reason. In the end it was mostly seen in servers, where the details were easier to oversee.
-* device memory maps would take mappable memory away from within each process, which for 32-bit OSes would often mean that you couldn't use all of that installed 4GB
-'''On 32-bit systems:'''
-Process-level details:
-* No ''single'' 32-bit process can ever map more than 4GB as addresses are 32-bit byte-addressing things.
-* A process's address space has reserved parts, to map things like shared libraries, which means a single app can actually ''allocate'' less (often by at most a few hundred MBs) than what it can map{{verify}}. Usually no more than ~3GB can be allocated, sometimes less.
-'''On 64-bit systems:'''
-* none of the potentially annoying limitations that 32-bit systems have apply
-: (assuming you are using a 64-bit OS, and not a 32-bit OS on a 64-bit system).
-* The architecture lets you map 64-bit addresses
-: ...in theory, anyway. The instruction set is set up for 64 bit everything, but the current x86-64 CPU implementation's address lines are 48-bit (for 256TiB), mainly because we can increase that later without breaking compatibility, and right now it saves copper and silicon 99% of computers won't use
-: ...because in practice it's still more than you can currently physically put in most systems. {{comment|(there are a few supercomputers for which this matters, but arguably even there it's not so important because horizontal scaling is ''generally'' more useful than vertical scaling. But there are also a few architectures designed with a larger-than-64-bit addressing space)}}
-On both 32-bit (PAE) and 64-bit systems:
-* Your motherboard may have assumptions/limitations that impose some lower limits than the theoretical one.
-* Some OSes may artificially impose limits (particularly the more basic versions of Vista seem to do this{{verify}})
-Windows-specific limitations:
-* 32-bit Windows XP (since SP2) gives you '''no PAE memory benefits'''. You may still be using the PAE version of the kernel if you have DEP enabled (no-execute page protection) since that requires PAE to work{{verify}}, but PAE's memory upsides are '''disabled''' {{comment|(to avoid problems with certain buggy PAE-unaware drivers, possibly for other reasons)}}
-* 64-bit Windows XP: ?
-* /3GB switch moves the user/kernel split, but a single process to map more than 2GB must be 3GB aware
-* Vista: different versions have memory limits that seem to be purely artificial (8GB, 16GB, 32GB, etc.) {{comment|(almost certainly out of market segregation)}}
-===Longer story / more background information===
-A 32-bit machine implies memory addresses are 32-bit, as is the memory address bus to go along. It's more complex, but the net effect is still that you can ask for 2^32 bytes of memory at byte resolution, so technically allows you to access up to 4GB.
-The 'but' you hear coming is that 4GB of address space doesn't mean 4GB of memory use.
-====The device hole (32-bit setup)====
-One of the reasons the limit actually lies lower is devices. The top of the 4GB memory space (usually directly under the 4GB position) is used to map devices.
-If you have close to 4GB of memory, this means part of your memory is still not addressible by the CPU, and effectively missing.
-The size of this hole depends on the actual devices, chipset, BIOS configuration, and more{{verify}}.
-The BIOS settles the memory address map{{verify}}, and you can inspect the effective map {{comment|(Device Manager in windows, /proc/iomem in linux)}} in case you want to know whether it's hardware actively using the space {{comment|(The hungriest devices tend to be video cards - at the time having two 768MB nVidia 8800s in SLI was one of the worst cases)}} or whether your motherboard just doesn't support more than, say, 3GB at all.
-Both these things can be the reason some people report seeing as little as 2.5GB out of 4GB you plugged in.
-This problem goes away once you run a 64-bit OS on a 64-bit processor -- though there were some earlier motherboards that still had old-style addressing leftovers and hence some issues.
-Note that the subset of these issues caused purely by limited address space on 32-bit systems could also be alleviated, using PAE:
-====PAE====
-It is very typical to use virtual memory systems.
-While the prime upside is probably the isolation of memory, the fact that a memory map is kept for each process also means that on 32-bit, each application has its ''own'' 4GB memory map without interfering with anything else (virtual mapping practice allowing).
-Which means that while each process could use 4GB at the very best, if the OS could see more memory, it might map distinct 4GBs to each process so that ''collectively'' you can use more than 4GB (or just your full 4GB even with device holes).
-Physical Address Extension is a memory mapping extension (not a hack, as some people think) that does roughly that.
-PAE needs specific OS support, but ''doesn't'' need to break the 32-bit model as applications see it.
-It allowed mapping 32-bit virtual memory into the 36 bit hardware address space, which allows for 64GB {{comment|(though most motherboards had a lower limit)}}
-PAE implies some extra work on each memory operation, but because there's hardware support it only kicked a few percent off memory access speed.
-All newish linux and windows version support PAE, at least technically.
-However:
-* The CPU isn't the only thing that accesses memory. Although many descriptions I've read seem kludgey, I easily believe that any device driver that does DMA and is not aware of PAE may break things -- such drivers are broken in that they are not PAE-aware - they do not know the 64-bit pointers that are used internally used should be limited to 36-bit use.
-* PAE was '''disabled''' in WinXP's SP2 to increase stability related to such issues, while server windowses are less likely to have problems since they use tend to use more standard hardware and thereby drivers.
-====Kernel/user split====
-{{stub}}
-The kernel/user split, specific to 32-bit OSes, refers to an OS-enforced formalism splitting the mappable process space between kernel and each process.
-It looks like windows by default gives 2GB to both, while (modern) linuces apparently split into 1GB kernel, 3GB application by default {{comment|(which is apparently rather tight on AGP and a few other things)}}.
-(Note: '3GB for apps' means that any ''single'' process is limited to map 3GB. Multiple processes may sum up to whatever space you have free.)
-In practice you may want to shift the split, particularly in Windows since almost everything that would want >2GB memory runs in user space - mostly databases.
-{{comment|The exception is Terminal Services (Remote Desktop), that seems to be kernel space.}}
-It seems that:
-* linuxes tend to allow 1/3, 2/2 and 3/1,
-* BSDs allow the split to be set to whatever you want{{verify}}.
-* It seems{{verify}} windows can only shift its default 2/2 to the split to 1GB kernel, 3GB application, using the /3GB boot option {{comment|(the feature is somewhat confusingly called 4GT)}}, but it seems that windows applications are normally compiled with the 2/2 assumption and will not be helped unless coded to. Exceptions seem to primarily include database servers.
-* You may be able to work around it with a 4G/4G split patch, combined with PAE - with some overhead.
-===See also===
-* http://www.dansdata.com/askdan00015.htm
-* http://linux-mm.org/HighMemory
-* [http://www-128.ibm.com/developerworks/linux/library/l-memmod/ Explore the Linux memory mode]
-* http://www.spack.org/wiki/LinuxRamLimits
-* http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory
-* http://kerneltrap.org/node/2450
-* http://en.wikipedia.org/wiki/3_GB_barrier
-<!--
-==Motherboards==
-Integrated graphics used to mean a chip on the motherboard (in or near the northbridge{{verify}}).
-This often meant the cheapest option a motherboard manufacturer could find, which is nice and minimal if you have no needs beyond office stuff and a bit of web browsing, but not enough for any shiny graphics worth staring at.
-It also ate some of your CPU, main memory.
-More recent integrated graphics is actually inside the CPU, and seem to be more like entry-level graphics cards and can play more games than the motherboard integrated graphics could.
-Also, the implications that there are very few options also means they are much clearer options.
-Gamers will always want a more serious video card - typically even something costing a few dozen bucks will be nicer.
-===PCI, PCI-Express===
-PCI Express (PCIe) is an already-common standard designed to replace the older PCI, PCI-X, and AGP standards {{comment|(PCI-X is PCI-eXtended, which was a variant on PCI, largely seen on servers, before there was PCIe. AGP was mostly used for video cards)}}
-PCI was enough for most low-bandwidth cards, but started creaking for some applications a while ago (video capture, gbit ethernet, and such).
-PCIe means more bandwidth, and is less of a single shared bus and more of a point to point thing (and theoretically more of a switched network thing{{verify}}), and is also symmetric and full duplex (can do the speed in both directions, and at the same time).
-: '''On PCIe speeds speeds and slots'''
-The slot basically consists of a small chunk of power (and simple management bus stuff), a bit of plastic, and the rest being the data lanes. You can eyeball what sort of slot you have by the size of the lane part.
-The common slots:
-* x1 (250MB/sec on PCIe 1.x) already much faster than PCI, and fast enough for many things.
-* x4 (500MB/sec on PCIe 1.x) used by some higher-speed devices (e.g. multi-port GBit controllers), some RAID controllers, and such
-* x16 (4GB/sec on PCIe 1.x) is used by video cards, some RAID controllers, and such
-* (x2, x8 and x32 exist, but are not seen very often)
-You can always plug PCIe cards into a larger slots.
-* Speeds can refer both to a slot (its size is largely dictated by its lanes{{verify}}), and the speed that it can do.
-** Which isn't always the same. There are e.g. motherboards with x16 slots that only do x8 speeds. ...for example because x16 was faster than most CPU and memory bus speeds at the time of introduction, which would make true x16 a waste of your money.
-PCIe specs actually mention gigatransfers/sec. Given byte lanes, and assuming [http://en.wikipedia.org/wiki/8b/10b_encoding 8b/10b] coding, this means dividing the GT/s figure by 10 to get MByte/s.
-The speeds mentioned above are for PCIe 1, which can do 2.5 GT/s per lane.
-For comparison:
-* v1.x: 250 MByte/s/lane (2.5 GT/s/lane)
-* v2.x: 500 MByte/s/lane (5 GT/s/lane)
-* v3.0: 1 GByte/s/lane (8 GT/s/lane)
-* v4.0: 2 GByte/s/lane (16 GT/s/lane)
-Note that both device and motherboard need to support the higher PCIe variant to actually use these speeds.
--->
-=Some understanding of memory hardware=
-[https://people.freebsd.org/~lstewart/articles/cpumemory.pdf "What Every Programmer Should Know About Memory"] is a good overview of memory architectures, RAM types, reasons bandwidth and access speeds vary.
-==RAM types==
-'''DRAM''' - Dynamic RAM
-: lower component count per cell than most (transistor+capacitor mainly), so high-density and cheaper
-: yet capacitor leakage means this has to be refreshed regularly, meaning a DRAM controller, more complexity and higher latency than some
-: (...which can be alleviated and is less of an issue when you have multiple chips)
-: this or a variant is typical as main RAM, due to low cost per bit
-'''SDRAM''' - Synchronous DRAM - is mostly a practical design consideration
-: ...that of coordinating the DRAM via an external clock signal (previous DRAM was asynchronous, manipulating state as soon as lines changed)
-: This allows the interface to that RAM to be a predictable state machine, which allows easier buffering, and easier interleaving of internal banks
-: ...and thereby higher data rates (though not necessarily lower latency)
-: SDR/DDR:
-:: DDR doubled busrate by widening the (minimum) units they read/write (double that of SDR), which they can do from single DRAM bank{{verify}}
-:: similarly, DDR2 is 4x larger units than SDR and DDR3 is 8x larger units than SDR
-:: DDR4 uses the same width as DDR3, instead doubling the busrate by interleaving from banks
-:: unrelated to latency, it's just that the bus frequency also increased over time.
-'''Graphics RAM''' refers to varied specialized
-: Earlier versions would e.g. allow reads and writes (almost) in parallel, making for lower-latency framebuffers
-: "GDDR" is a somwhat specialized form of DDR SDRAM
-'''SRAM''' - Static RAM
-: Has a higher component count per cell (6 transistors) than e.g. DRAM
-: Retains state as long as power is applied to the chip, no need for refresh, also making it a little lower-latency
-: no external controller, so simpler to use
-: e.g used in caches, due to speed, and acceptable cost for lower amounts
-'''PSRAM''' - PseudoStatic RAM
-: A tradeoff somewhere between SRAM and DRAM
-: in that it's DRAM with built-in refresh, so functionally it's as standalone as SRAM and slower but you can have a bunch more of it for the same price - e.g. SRAM tends to
-: (yes, DRAM can have built-in refresh, but that's often points a ''sleep'' mode that retains state without requiring an active DRAM controller)
-<!--
-'''Non-volatile RAM'''
-The concept of Random Access Memory (RAM) '''only''' tells you that you can access any part of it with similar ease (contasted with e.g. tape storage, where more distance meant more time, so more storage meant more time).
-Yet we tend to think about RAM as volatile, as entirely temporary scratchpad, only useful as an intermediate between storage and use.
-This is perhaps because the simplest designs (and thereby cheapest per byte) have that property.
-For example, DRAM loses its charge and has to be constantly and actively refreshed, DRAM and SRAM and many others lose their state once you remove power.
-(There are also exceptions and inbetweeens, like DRAM that doesn't need its own controller and can be told to refresh itself in a low-power mode, acting a whole lot like SRAM).
-Yet there are various designs that are both easily accessible ''and'' keep their state.
-It's actually a gliding scale of various properties.
-We may well call it NVM (non-volatile memory), when grouping a lot of them and don't yet care about further properties - like how often we may read or write or how difficult that is. Say, some variants of EEPROM aren't the easiest to deal with, and consider that Flash, now very common and quite convenient, is a development from EEPROM.
-When we talk about NVRAM rather than NVM when we are often pointing at more specific designs,
-often where we can fairly easily use it and it happens to stick around,
-like in FRAM, MRAM, and PRAM, or nvSRAM or even BBSRAM.
-FRAM - Ferroelectric RAM, which resembles DRAM but uses a ferroelectric material,
-: easier to access than Flash
-: seems to have a read limit rather than a write limit?, but that limit is also something like 1E14 and you are ''unlikely'' to use it so intensely to reach that any time soon.
-: so it's great for things like constant logging, which would be terrible for Flash
-https://electronics.stackexchange.com/questions/58297/whats-the-catch-with-fram
-nvSRAM - SRAM and EEPROM stuck on the same chip.
-: https://en.wikipedia.org/wiki/NvSRAM
-BBSRAM - Battery Backed SRAM
-: basically just SRAM ''alongside'' a lithium battery
-: feels like cheating, but usefully so.
--->
-===Memory stick types===
-{{stub}}
-'''ECC RAM'''
-: can detect many (and correct some) hardware errors in RAM
-: The rate of of bit-flips is low, but will happen. If your computations or data are very important to you, you want ECC.
-: See also:
-:: http://en.wikipedia.org/wiki/ECC_memory
-:: {{search|DRAM Errors in the Wild: A Large-Scale Field Study}}
-'''Registered RAM''' (sometimes '''buffered RAM''') basically places a buffer on the DRAM modules {{comment|(register as in [https://en.wikipedia.org/wiki/Hardware_register hardware register])}}
-: offloads some electrical load from the main controller onto these buffers, making it easier to have designs more stably connect ''more'' individual memory sticks/chips.
-: ...at a small latency hit
-: typical in servers, because they can accept more sticks
-: Must be supported by the memory controller, which means it is a motherboard design choice to go for registered RAM or not
-: pricier (more electronics, fewer units sold)
-: because of this correlation with server use, most registered RAM is specifically registered ECC RAM
-:: yet there is also unregistered ECC, and registered non-ECC, which can be good options on specific designs of simpler servers and beefy workstations.
-: sometimes called RDIMM -- in the same context UDIMM is used to refer to unbuffered
-: https://en.wikipedia.org/wiki/Registered_memory
-'''FB-DIMM''', Fully Buffered DIMM
-: same intent as registered RAM - more stable sticks on one controller
-: the buffer is now ''between'' stick and controller [https://en.wikipedia.org/wiki/Fully_Buffered_DIMM#Technology] rather than on the stick
-: physically different pinout/notching
-'''SO-DIMM''' (Small Outline DIMM)
-: Physically more compact. Used in laptops, some networking hardware, some Mini-ITX
-EPP and XMP (Enhanced Performance Profile, Extreme Memory Profiles)
-: basically, one-click overclocking for RAM, by storing overclocked timing profiles
-: so you can configure faster timings (and V<sub>dimm</sub> and such) according to the modules, rather than your trial and error
-: normally, memory timing is configured according to a table in the [https://en.wikipedia.org/wiki/Serial_presence_detect SPD], which are JEDEC-approved ratings and typically conservative.
-: EPP and XMP basically means running them as fast as they could go (and typically higher voltage)
-On pin count
-: SO-DIMM tends to have a different pin count
-: e.g. DDR3 has 240 pins, DDR3 SO-DIMM has 204
-: e.g. DDR4 has 288 pins, DDR4 SO-DIMM has 260
-: Registered RAM has the same pin count
-: ECC RAM has the same pin count
-In any case, the type of memory must be supported by the memory controller
-: DDR2/3/4 - physically won't fit
-: Note that while some controllers (e.g. those in CPUs) support two generations, a motherboard will typically have just one type of memory socket
-: registered or not
-: ECC or not
-Historically, RAM controllers were a thing on the motherboard near the CPU, while there are now various cases where the controller is on the CPU.
-==More on DRAM versus SRAM==
-{{stub}}
-<!--
-'''Dynamic RAM (DRAM)''' cells are a transistor and capacitor, much simpler than various other types of RAM.
-The transistor controls
-: writes (set level on the data line, then raise cell access line long enough for charge/discharge to that level)
-: and reads (raise the access line for discharge into the data line, which has something that senses whether there was charge).
-That means reads are slowish, but more importantly, reads are destructive of the row it is in, and there is a mechanism to store it back.
-Separately, capacitors slowly leak charge anyway (related to closeby cells, related to their bulk addressing, and note that the higher the memory density, the smaller the capacitor so the sooner this all happens), so DRAM only makes sense with refresh: when there is something going through reading every cell and writing it back, ''just'' to keep the state over time.
-The DRAM controller will refresh each DRAM row within (typically) 64ms, and there are order of thousands tens of thousands of them in a DRAM chip.
-Yes, this means you randomly incur some extra latency.
-Larger chips effectively have longer refresh overhead.
-Each chip is slower-than-ideal, which can be made irrelevant by having the same amount of RAM in more chips on a memory stick. (Seems to also part of why servers often have more slots{{verify}})
-It also means DRAM will require more power than most others, even when it's not being used.
-With all these footnotes, DRAM seems clunky, so why use it?
-Mainly because it's rather cheaper per bit (even with economy of scale in production),
-and as mentioned, you can alleviate the performance part fairly easily.
-The first thing you'ld compare DRAM to is often '''SRAM (Static RAM)''', or some variant of it.
-SRAM cells are more complex per bit, but don't need refresh,
-are fundamentally lower-latency than DRAM, and take less power when idle.
-(with some variation; lower speed SRAM can be low power, whereas at high speeds and use power can can be comparable to DRAM)
-The main downside is that due to their complexity, they are lower density, cost more silicon (and therefore money) per bit.
-There are a lot of high-speed cases, or devices, where a little SRAM makes a lot of sense, like network switches,
-and also L1, L2, and L3 caches in your computer.
-SRAM is electrically easier to access (also means you need less of a separate controller),
-so simple microcontrollers may prefer it, also because it's easier to embed on the same IC.
-Since SRAM uses noticeably more silicon than DRAM per cell, SRAM is often under a few hundred kilobyte - in part because you'ld probably use SRAM for important bits, alongside DRAM for bulkier storage.
-----
-'''Pseudostatic RAM (PSRAM, a.k.a. PSDRAM)''' are ICs that contains both DRAM and a controller, so has DRAM speeds, but are as easy to use as SRAM, and a price somewhere inbetween.
-There are even variants that are basically DRAM with an SRAM cache in front
-so that well controlled access patterns can be quite fast.
-----
-More DRAM notes:
-For a few reasons (including that there are a ''lot'' of bits in the address, to save dozens of pins as well as silicon on internal demultiplexing), DRAM is typically laid out as a grid, and the address is essentially sent in two parts, the row and the column, sent one after the other.
-This is what RAS and CAS are about - the first is a strobe that signals the row address can be used, the second that the column can be.
-And, because capacitors are not instant, there needs to be some time between RAS and CAS, and between CAS and data coming out. This, and other details (e.g. precharge) are a property of the particular hardware, and should be adhered to to be used reliably.
-Setting these parameters would be annoying, so on DDR DRAM sticks there is a small chip[https://en.wikipedia.org/wiki/Serial_presence_detect] that tells the BIOS you the timing options.
-----
-DRAM is also so dense that it has led to some electrical issues, e.g. the [https://en.wikipedia.org/wiki/Row_hammer row hammer] exploit.
-----
-Because you still spend quite bit of time on addressing, before the somewhat-faster readout of data,
-a lot of DRAM systems do prefetch/burst (what you'd call readahead in disks).
-That is, instead of fetching a cell, it fetches (burst_length*bus_width), with burst_length apparently linked to DDR type, but 64 bytes for DDR3 and DDR4. (also because that's a common CPU cache line size)
-This is essentially a forced locality assumption, but it's relatively cheap and frequently useful.
-"RAS Mode"
-: lockstep mode, 1:1 ratio to DRAM clock
-:: more reliable
-: independent channel mode, a.k.a. performance mode, 2:1 to DRAM clock
-:: more throughput
-:: also allows more total DIMMs (if your motherboard is populated with them)
-: mirror - seems to actually refer to memory mirroring.
-Note this is about the channels, not the DRAM.
-https://www.dell.com/support/article/nl/nl/nlbsdt1/sln155709/memory-modes-in-dual-processor-11th-generation-poweredge-servers?lang=en#Optimizer
-https://software.intel.com/en-us/blogs/2014/07/11/independent-channel-vs-lockstep-mode-drive-you-memory-faster-or-safer
-In PCs, the evolution from SDR SDRAM to DDR SDRAM to DDR2 SDRAM to DDR3 SDRAM is a fairly simple one.
-SDR ():
-* single pumped (one transfer per clocktick)
-* 64-bit bus
-* speed is  8 bytes per transfer * memory bus rate
-DDR (1998):
-* double pumped (two transfers per clocktick, using both the rising and falling edge)
-* 64-bit bus
-* speed is  8 bytes per transfer * 2 * memory bus rate
-DDR2 (~2003):
-* double pumped
-* 64-bit bus {{verify}}
-* effective bus to memory is clocked at twice the memory speed
-* No latency reduction over DDR (at the same speed) {{verify}}
-* speed is  8 bytes per transfer * 2 * 2 * memory bus rate
-DDR3 (~2007):
-* double pumped
-* 64-bit bus {{verify}}
-* effective bus to memory is clocked at four times the memory speed
-* No latency reduction over DDR2 (at the same speed) {{verify}}
-* speed is  8 bytes per transfer * 2 * 4 * memory bus rate
-DDR4 (~2014)
-* double pumped
-DDR5 (~2020)
-*
-Each generation also lowers voltage and thereby power (per byte).
-(Note: Quad pumping exists, but is only really used in CPUs)
-The point of clocking the memory bus higher than the speed of individual memory cells
-is that as long as you are accessing data from two distinctly accessed cells,
-you can send both on the faster external (memory) bus. {{verify}}
-It won't be twice, but depending on access patterns might sometimes get close{{verify}}.
-Dual channel memory is different yet - it refers to using an additional 64-bit bus to memory in addition to the first 64 bits, so that you can theorhetically transfer at twice the speed.
-The effect this has on everyday usage depends a lot on what that use is, though.
-It seems that even the average case is not too noticeable improvement.
-so four bits of data can be transferred per memory cell cycle. Thus, without changing the memory cells themselves, DDR2 can effectively operate at twice the data rate of DDR.
-Within a type (within SDR, within DDR, within DDR2, etc.), the different speeds do not point to different design. Like with CPUs, it just means that the memory will work under that speed. Cheap memory may fail if clocked even just a littl higher, while much more tolerant memory also exists, which is interesting for overclockers.
-Note that the bus speed a particular piece of memory will work under depends on how c
-Transfers per clocktick:
-* 1 for SDR/basic SDRAM
-* 2 for DDR SDRAM
-* 4 for DDR2 SDRAM
-* 8 for DDR3 SDRAM
-* DDR4
-* DDR5
-SDRAM was available as:
-MHz
-MHZ
-MHz
-DDR used to have some commonly used aliases, e.g.
- alias     standard name   speed
- PC-1600   DDR-200         100MHz, 200Mtransfers/s, peak 1.6GB/s
- PC-2100   DDR-266         133MHz, 266Mtransfers/s, peak 2.1GB/s
- PC-2700   DDR-333         166MHz, 333Mtransfers/s, peak 2.7GB/s
- PC-3200   DDR-400         200MHz, 400Mtransfers/s, peak 3.2GB/s
-As of right now (late 2008), DDR3 is not it money/performancewise, but DDR2 is interesting over DDR.
- PC2-3200  DDR2-400        100MHz, 400Mtransfers/s, peak 3.2GB/s
- PC2-4200  DDR2-533        133MHz, 533Mtransfers/s, peak 4.2GB/s
- PC2-5400  DDR2-667        166MHz, 667Mtransfers/s, peak 5.4GB/s
- PC2-6400  DDR2-800        200MHz, 800Mtransfers/s, peak 6.4GB/s
--->
-==On ECC==
-<!--
-Like disks, RAM has an error rate, for a few reasons [http://arstechnica.com/business/2009/10/dram-study-turns-assumptions-about-errors-upside-down/] [http://www.cnet.com/news/google-computer-memory-flakier-than-expected/] [http://www.zdnet.com/blog/storage/dram-error-rates-nightmare-on-dimm-street/638].
-It's tiny, but it's there, and there's a fix via error correction methods.
-These can typically fix the error if it's just one bit. {{comment|(Two bits at a time, which can be detected but not fixed, are rare unless you've got a faulty stick, or are overclocking it to the point of instability)}}
-On servers there is more interest in not having a weak link that can flip some bits without you noticing,
-and send a bad version of the data to disk, or generating them during lots of hard calculation.
-So it's typically used in storage servers, clusters, and possibly everything important enough to
-be housed in a server room, in particular when its admin want to sleep a little less nervously.
-On workstations you may care less.
-There is decent chance errors occur in unused areas, program code programs doing nothing of long-term consequence,
-or data that is read but will not be written to disk. Programs may crash but the system may not. Video may merely glitch.
-In computing clusters, it may be entirely viable (and a good idea anyway) to double-check results
-and just redo the tiny part of a job that doesn't make sense.
-For devices that store data you really care about, consider ECC. The tradeoffs are actually more complex,
-in that there are other parts of the whole that can make mistakes for you, so this ECC is just about removing ''one'' weak link,
-while you are still leaving others.
-In theory, whenever you're ''not'' altering an authoritative store, ECC is less important.
-Intel has a weird separation in that ECC support is ''disabled'' in consumer CPUs,
-apparently to entice businesses to buy Xeons with their ECC combination.
-If you're setting up a storage server at home,
-or otherwise care about a few hundred dollar difference,
-then an ECC-capable motherboard+Xeon+ECC RAM is a pricy combination,
-so it's common enough to go AMD instead,
-simply because it's more flexible in its combinations.
--->
-<!--
--->
-==Buffered/registered RAM==
-<!--
-This is mainly a detail of motherboard (and CPU) design.
-Registered RAM places less electrical load on the controller than,
-meaning you can stuff more slots/sticks on the same motherboard.
-The buffer/register refers to the part stuck inbetween,
-which also makes it slightly slower.
-Buffered RAM is mostly interesting for servers that ''must'' have a lot of RAM.
-http://en.wikipedia.org/wiki/Registered_memory
--->
-==EPROM, EEPROM, and variants==
-PROM is Programmable ROM
-: can be written exactly once
-EPROM is Erasable Programmable ROM.
-: often implies UV-EEPROM, erased with UV shone through a quartz window.
-EEPROM's extra E means Electrically Eresable
-: meaning it's now a command.
-: early EEPROM read, wrote, and erased {{verify}} a single byte at a time. Modern EEPROM can work in alrger chunks.
-: you only get a limited amount of erases (much like Flash. Flash is arguably just an evolution of EEPROM)
-<!--
-EEPROMs tend to erase in chunks.
-There is typically no separate erase, erase happens transparently on writes - it reads a page into its tiny RAM, erases, and writes the whole thing back.
-This means it has a [[write hole]],
-and erases faster than you may think.
-That said, the erase count is relatively high for something you may not consider to be continuously alterable storage.
--->
-==Flash memory (intro)==
-{{stub}}
-<!--
-Flash is a type of EEPROM, and a refinement on previous 'plain' EEPROM.
-The name came from marketing, helping to distinguish it as its own thing with its own properties.
-Like EEPROM, Flash is non-volatile, erases somewhat slowly, and has a limited number of erase cycles.
-One difference is making it erasable in chunks (smaller than erase-fully variants, larger than erase-bytes variants),
-a tradeoff that helps speed, cost, and use for random-storage needs.
-Simpler memory cards, and simpler USB sticks, have one flash chip and a simple controller, which is the cheapest setup and why they don't tend to break 10MB/s,
-and don't have enough wear leveling to last very long.
-SSDs go faster
-partly because they parallelize to more chips (RAID-like layout), and
-partly because of an extra layer of management that (in most practical use) hides Flash's relatively slow erase speed (by doing it at other times, not when it's needed, a plan that usually works but may not under heavy load).
--->
-<!--
-There are two types of Flash, NOR and NAND, named for the cells resembling (and working like) classical logic gates.
-Very roughly,
-NOR is faster but more expensive so has a few specialist uses,
-NAND is denser, slower, and takes less power, is typically more useful and cheaper for bulk storage.
-In terms of storage area, Flash is denser than platter - but that's only true if we don't count the size of IC packages.
-In practice the physical overhead of both makes them comparable.
-In comparison:
-* Reading is three orders of magnitude slower than DRAM
-* Writing is four orders of magnitude slower than DRAM
-'''Flash data retention (active use)'''
-Active use of cells wears the semiconductor, and lowers the ability to stably retain charge.
-This is expressed in the amount of erases it will take - which differ between SLC, TLC, MLC.
-It's a bad idea for SSDs to juse use it until you can no longer read it, because that's just
-storage failure. As such, they try to be pessimistic/conservative.
-USB sticks and memory cards ''are'' often a bit tralala about it.
-'''Flash data retention (idle shelf time)'''
-State is kept as charge, so you may ask "how non-volatile is non-volatile?"
-The short answer is that flash is not meant for archival purposes.
-Flash producers often give a spec of ten years.
-But that's a rated value, and with some assumptions.
-Including extrapolation, as there are no long-term studies on NAND{{verify}}.
-This also differs per type (TLC, MLC, SLC) because closer voltage levels imply
-the same decay makes more difference.
-There are some people that say just powering up flash will make its controller refresh the data.
-AFAICT you this is generally not true - you should ''never'' assume this on memory cards and USB sticks.
-It ''may'' be true for some SSDs{{verify}}, but unless you know your model does, don't bet your data on it.
-In other words, if you want a fresh copy of your data, read it all off, and write it back.
-{{comment|(...just as you should be doing on platter disks, and are also not doing)}}.
--->
-==PRAM==
-<!--
-https://en.wikipedia.org/wiki/Phase-change_memory
--->

On computer memory: Difference between revisions

Latest revision as of 11:53, 10 July 2023

Navigation menu