Computer data storage - SSD notes

From Helpful
Jump to: navigation, search
Computer data storage
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Pricing of SSDs is for a large part the wafer manufacturing, which means that die shrinks will cut price {comment|(but moving to a new manufacturing process takes time and money, both of which meaning that the effect on prices consumers see is only gradual)}} any other trick to store more on the same silicon is interesting. This is where the following section comes in:


You'll often see the terms SLC and MLC. More specifically:

  • SLC (single layer cell): 1 bit per cell
  • MLC (multi layer cell): 2 bits per cell
  • TLC (triple layer cell): 3 bits per cell

This is not actually a physical difference in the cells, it's about the way they are read out and written. Any specific hardware tends to use only one, though.

The obvious upside to more bits per cell is more storage per area, and therefore cheaper storage.

There are also a few downsides to the higher density. Storing reliably at a higher density takes more care, so more latency (and less throughput when using the same amount of flash chips, but that's rarely a problem because there are typically enough chips side by side).

Perhaps most importantly, the more complex means of access means cells wear faster. Approximate numbers:

  • SLC should last at least 100000 erases (verify) (which seems to mean "it'll probably fail in another way first")
  • MLC should last at least lasts 10000, though there are variants rated at 5000 or so (verify)
  • TLC should last at least 3000

...though there is variation with each type, manufacturing requirements, etc.

For consumers MLC is currently a nice tradeoff, in terms of price per size, a lifetime good enough for everyday computer use.

SLCs last longer, and can work out as cheaper over time for heavier workloads. This and their lower latency makes them interesting for servers, e.g. database needs.

TLC gives us cheap SSDs, though the price-per-GB improvement for TLC seems less than the bit density suggests, and it wears faster, so it isn't always much more interesting than MLC. They can be useful for read-heavy workloads, say, download sites.

So it's somewhat (...but not quite) like: "Low cost, good performance, high capacity, good lifetime: choose three"

On speed, and the reasons it varies significantly between uses

Compared to a decent HDD:

  • random write: For cheaper drives, this is on the order of HDD speeds, sometimes worse. On good drives it's slightly better.
  • random read: faster
  • sequential read: faster
  • sequential write: similar
  • speed over time: degrades as well (but but different reasons; see below)
  • application launch: faster, because this is mostly (and relatively random) reads
  • application tests: slightly faster, but often hard to tell because there's little IO most of the time. Running many thing at the same time will be faster in that IO won't degrade if it's mostly reads

Some ballpark figures, for MLC

  • Random write
    • Less than <1 MB/s on some cheap MLC drives, down to .05 or so in the worst cases
    • 1..15 MB/s on moderate MLC
    • up to 20 or 30 or 40 on good MLC drives, and on SLC
    • HDDs: ~1MB/s.
  • Random read
    • 15..50 MB/s (latency often .2 to .7 ms)
    • HDDs: <1 MB/s
  • Sequential write
    • 70..180 MB/s (some down to 45)
    • HDDs: ~100 MB/s
  • Sequential read
    • 100..250 MB/s
    • HDDs: ~100 MB/s

Maximum sequential throughput benchmarks are stupid. There are few real-world use patterns that are sequential and fast enough to use all that speed, and do something useful in the process. Yes, some large-scale well-optimixed number crunching. Little server use. Not your desktop.

Also, how drives manage their data affects what it is tuned for. Those with high throughput figures may actually do worse for your real, random workloads - and reviews that report only throughput are indirectly asking for that, because we buy these things based on something we can easily grasp.

Most everyday OS and application work is random reads and random writes. (In somewhat consistent and somewhat predictable areas, but random for most real purposes)

SSD random read latency is simple: It's better than HDD's. It's a major reason applications launch faster, why applications running in parallel often bother each other less (depending on the application), and why load time in games tends to be lower (that one can also depend a lot on game-specific tuning for HDDs).

SSD random write latency varies. It depends on a lot of lower-level details, how clever the OS is, and what applications actually do. In bad cases can be worse than HDD's. Since random writes are fairly common in many workloads, random write performance one of the most significant performance metrics of an SSD, unless you are buying it for a specific use where you know it isn't (e.g. a cache drive). (There were some early models, and some current cheaper ones, which had very high random write latency, and as a result, fairly miserable random write throughput that any HDD could beat)

More details

A page is a smallish set of cells, the smallest that can be written, and are currently often 4KB (larger on a few models).

Pages are grouped in blocks. A block is the smallest unit that can be erased, often 64 or 128 pages, so currently on the order of 256KB or 512KB or so.

Pages, blocks, and write amplification
Spare blocks, overprovisioning
Spare blocks, wear, and speed degradation

TODO: figure out in enough detail to actually write about it



  • (...assuming we only worry about flash erase cycles...)
  • For light to moderate use use, an SSD should last longer than the average platter drive
  • pathological cases will wear them faster than platter drives

The lifetime of flash is determined primarily by how many times you can erase it, combined by how often you actually do.

The amount of erases varies with production, but in a very rough estimate:

SLC gives on the order of 100000 erases,
MLC often 10000 erases,
(eMLC, enterprice MLC, around 30000),
TLC more like 4000 erases

So how often do you need to erase?

This is primarily related to how often/fast you need to write data.
Plus write amplification.

There is a low factor difference between various real-life uses, and more than an order of magnitude between extreme cases.

Say, TLC is certainly worse when you process terabyte datasets every day, yet if you want a write-rarely-read-heavy cache it's awesome value for money.

For lifetime in terms of, well, time , you need some assumptions.

Let's make an optimistic calculation:

  • you have 10000 disk-erases (you chose MLC, not the cheapest, not the fanciest)
  • you write 10GB/day (the order of magnitude for moderate everyday computer use)
  • ...of large blocks of data, so no write amplification (so 10GB writes is approx. 10GB of erase)
  • perfectly spread wear (e.g. a scratch disk that gets cleared daily)

Ten thousand erase cycles doesn't sound like a lot, but keep in mind it's relative to the size:

Writing 10GB to a 10GB drive means one disk-erase every day; 10k erases at one per day is 27 years,
writing 10GB to a 100GB drive means one disk-erase every ten days; 10k erases at one per ten days is 270 years
writing 10GB to a 1TB drive means ~three overall erases per year, and so on.

(keep in mind that numbers above a dozen years start to be less meaningful, because it's likely that the electronics around the storage breaks first)

DWPD (Drive Writes Per Day) was introduced as another way of thinking in this way. One DWPD means you write the entire size's worth per day.

For example, if a drive mentiones 3DPWD and has a 4 year warranty, that's around ~4000 erases and it's probably TLC with the assumption that writes are 1:1 with erases.(verify)

TBW, Total Bytes Written is arguably slightly more useful, as it entangles writes, erases, and an assumed write amplification factor (typically unmentioned, in which case you can assume it is optimistic everyday-use).

For example, a TBW of 64TB on a 120GB drive means, and an assumed amplification of 4, is approximately 2000 erase cycles.(verify)

I see both these terms as being as obfuscating as they are convenient. YMMV.

On unequal wear, and amplification factor

The previous bit sounds optimistic, and is. In reality you may find it wears factors faster.


  • a portion of your writes are small
e.g. writing a few lines into multiple logs, as soon as possible (modern loggers may choose to flush more slowly to preserve SSD)
OS buffers tend to flush based on timeout, so if you log at all you get a minimum rate of writes
but for debug around kernel panics and such you may want to flush faster
  • various writes go to two places (e.g. filesystem data and metadata are often in different blocks)
  • writes smaller than a block tend to still mean a whole-block erase
  • people may write more data than that ~10GB/day casual-workstation-use figure
intense use can be much faster. Consider e.g. serious number crunching that e.g. works in many "take 100GB, process it, write 100GB" steps. That can be terabytes per day. (For such uses you may be better off with platter RAID. Or at least SLC. Datacenters tend to go for SLC.)
  • Backup/archiving onto SSDs is easily one erase per backup (which is actually not that bad given its merits)
  • some SSDs compress data (this can be a positive effect)

The combined effect is often expressed as amplification factor, with the meaning of 'it may wear this factor faster', and under heavier use you should assume it's a low factor higher than ideal.

So if you like careful estimates, then instead of those 27 and 270 years, assume maybe 4 and 40.

On wear leveling

Raw flash storage has no wear leveling.

In a pathological case you will have one sector wear out hundreds of times faster, and the device become useless as a whole even if all the other cells have all the life left in them.

Many SSDs, will do dynamic wear leveling, meaning what is the same location to the OS is backed by a different area on each write. Roughly, this means all actively written areas wear equally quickly.

There is also static wear leveling, which adds to the previous section by also moving the rarely-written sectors around.

For an intuition of why, consider that if 90% of your disk is full with files that were written there once, and since then are only ever read (most of the OS, music, video, etc.), then 10% of your disk sees all the writes, so wears ten times faster and will be the reason you need to retire he whole even if 90% may have seen just one erase ever. Dynamic wear leveling applies only to actively written areas so has no effect on this.

Static wear leveling, while increasing the amount of writes overall, helps spread that around. It is also a little slower overall, but often worth it.


  • wear leveling doesn't reduce amplification factor
but it does postpone the time at which single blocks start having to be retired due to that amplification factor
(there's are spare blocks, but you'll run out of them just as quickly as you ran out of the non-spares)
  • dynamic and static wear leveling may be implemented not over the entire SSD, but within specific parts
implying this leveling is not globally spread - but even then is decent
  • some filesystems are designed to do wear leveling on top of storage that doesn't
  • Presumably there exist controllers (you'ld then mostly see them used in cheap knockoff USB stick / memory cards) that do not add any wear leveling -- yet it's also presumable that most things with a brand on it have at least halfway decent wear leveling, to not become known as the ones that are significantly shitter.
In various cases it's still an unknown to consumers, though.
For example with SD cards it's not in the spec, so even assuming wear leveling is present you won't know its quality - or failure modes. You should probably assume all SD cards have a write-hole problem.

How do SSDs fail

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Reasons behind failure:

  • plain old component failure -
seems to be a decent portion of real-world failures
Note that if the flash chips are unaffected, a recovery lab can may be able to transplant them for recovery.
like any component failure. Assume it may be dead, so never have just one copy. Like any storage.
  • Unrecoverable bit errors - basically, every so often a bit can get stored wrong.
Because MLC and TLC use more voltage levels in the same cell, they have less leeway and are more likely to have bit problems when a cell degrades.
This is considered in MLC/TLC design, in that they use more cells for ECC (error correction) than fewer-level-per-cell variants -- yet it's a little more complex than just that.
  • unreliable power, e.g. an overloaded power supply or a power surge, has been known to corrupt data.
  • Firmware bugs.
Happened on some early models, though bugs still pop up.
  • Block erase failure - the predictable one.
After a while, a block cannot be erased/programmed. The firmware detects this, reallocates the data to a spare block, and mark the old one as dead
This makes the real question "what happens after all spare blocks are gone?", which is up to the firmware (and its interaction with the controller).

It's long been "I heard that"-d that SSDs are designed to become read-only on failure and not lose data.

In reality, some fail gracefully and are easily recovered, some only stay readable in the same reboot, some brick completely.

So if you care about your data, you should absolutely never assume this.

That said, most of them give good early warning, via tooling, and sometimes also by becoming noticeably slower (...which you'll notice in workstations more than servers).


SSDs tend to use between 0.5 and 1 Watt when idle, and between 1 and 4 Watts when active.

Which is about the same as a 2.5" hard drive, so don't expect SSDs to make your laptop battery last longer.

There is even some cases where SSDs are worse. A decent chunk of power use on HDDs is moving heads about, so sequential operations are not the worst case. SSDs don't have such a difference.

(Also, power per gigabyte is typically worse, at least for the smaller SSDs)

SSDs have not focused on power saving yet, so things may change.

See also

Most of my information came from Anand.