Computer data storage - SSD notes

From Helpful
Jump to: navigation, search
Computer data storage
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Price

Pricing of SSDs is for a large part the wafer manufacturing, which means that die shrinks will cut price {comment|(but moving to a new manufacturing process takes time and money, both of which meaning that the effect on prices consumers see is only gradual)}}

...so any other trick to store more on the same silicon is interesting. This is where the following section comes in:

SLC, MLC, TLC

You'll often see the terms SLC and MLC. More specifically:

  • SLC (single layer cell): 1 bit per cell
  • MLC (multi layer cell): 2 bits per cell
  • TLC (triple layer cell): 3 bits per cell

This is not actually a physical difference in the cells, it's about the way they are read out and written. Any specific hardware tends to use only one, though.


The obvious upside to more bits per cell is more storage per area, and therefore cheaper storage.


There are also a few downsides to the higher density. Storing reliably at a higher density takes more care, so more latency (and less throughput when using the same amount of flash chips, but that's rarely a problem because there are typically enough chips side by side).

Perhaps most importantly, the more complex means of access means cells wear faster. Approximate numbers:

  • SLC should last at least 100000 erases (verify) (which seems to mean "it'll probably fail in another way first")
  • MLC should last at least lasts 10000, though there are variants rated at 5000 or so (verify)
  • TLC should last at least 3000

...though there is variation with each type, manufacturing requirements, etc.


For consumers MLC is currently a nice tradeoff, in terms of price per size, a lifetime good enough for everyday computer use.

SLCs last longer, and can work out as cheaper over time for heavier workloads. This and their lower latency makes them interesting for servers, e.g. database needs.

TLC gives us cheap SSDs, though the price-per-GB improvement for TLC seems less than the bit density suggests, and it wears faster, so it isn't always much more interesting than MLC. They can be useful for read-heavy workloads, say, download sites.


So it's somewhat (...but not quite) like: "Low cost, good performance, high capacity, good lifetime: choose three"

On speed, and the reasons it varies significantly between uses

Compared to a decent HDD:

  • random write: For cheaper drives, this is on the order of HDD speeds, sometimes worse. On good drives it's slightly better.
  • random read: faster
  • sequential read: faster
  • sequential write: similar
  • speed over time: degrades as well (but but different reasons; see below)
  • application launch: faster, because this is mostly (and relatively random) reads
  • application tests: slightly faster, but often hard to tell because there's little IO most of the time. Running many thing at the same time will be faster in that IO won't degrade if it's mostly reads


Some ballpark figures, for MLC

  • Random write
    • Less than <1 MB/s on some cheap MLC drives, down to .05 or so in the worst cases
    • 1..15 MB/s on moderate MLC
    • up to 20 or 30 or 40 on good MLC drives, and on SLC
    • HDDs: ~1MB/s.
  • Random read
    • 15..50 MB/s (latency often .2 to .7 ms)
    • HDDs: <1 MB/s
  • Sequential write
    • 70..180 MB/s (some down to 45)
    • HDDs: ~100 MB/s
  • Sequential read
    • 100..250 MB/s
    • HDDs: ~100 MB/s


Maximum sequential throughput benchmarks are stupid. There are few real-world use patterns that are sequential and fast enough to use all that speed, and do something useful in the process. Yes, some large-scale well-optimixed number crunching. Little server use. Not your desktop.

Also, how drives manage their data affects what it is tuned for. Those with high throughput figures may actually do worse for your real, random workloads - and reviews that report only throughput are indirectly asking for that, because we buy these things based on something we can easily grasp.


Most everyday OS and application work is random reads and random writes. (In somewhat consistent and somewhat predictable areas, but random for most real purposes)

SSD random read latency is simple: It's better than HDD's. It's a major reason applications launch faster, why applications running in parallel often bother each other less (depending on the application), and why load time in games tends to be lower (that one can also depend a lot on game-specific tuning for HDDs).

SSD random write latency varies. It depends on a lot of lower-level details, how clever the OS is, and what applications actually do. In bad cases can be worse than HDD's. Since random writes are fairly common in many workloads, random write performance one of the most significant performance metrics of an SSD, unless you are buying it for a specific use where you know it isn't (e.g. a cache drive). (There were some early models, and some current cheaper ones, which had very high random write latency, and as a result, fairly miserable random write throughput that any HDD could beat)

More details

A page is a smallish set of cells, the smallest that can be written, and are currently often 4KB (larger on a few models).

Pages are grouped in blocks. A block is the smallest unit that can be erased, often 64 or 128 pages, so currently on the order of 256KB or 512KB or so.


Pages, blocks, and write amplification
Spare blocks, overprovisioning
Spare blocks, wear, and speed degradation

TODO: figure out in enough detail to actually write about it


Lifetime

tl;dr:

  • ...assuming we only worry about flash erase cycles...
  • For light to moderate use use, an SSD should last longer or as long as the average platter drive
  • pathological cases will wear them faster than platter drives



The lifetime of flash is determined primarily by how many times you can erase it, combined by how often you need to.

The amount of erases varies with production, but in a very rough estimate:

SLC gives on the order of 100000 erases,
(eMLC, enterprice MLC, around 30000),
MLC often 10000 erases,
TLC more like 4000 erases (varies a bunch - order of magnitude!)

How often you need to erase is a harder to estimate. This is primarily related to how often/fast you need to write data. Plus you need to consider write amplification.

The effect of these things in a few factors between different real cases, and an order of magnitude between extreme cases. Say, TLC is worse when you process terabyte datasets per day, yet awesome value for money (because more space) if you use it as a write-rarely-read-heavy cache.


The erase time is also faster for SLC and slowest for TLC.


For lifetime in terms of, well, time , you need some assumptions.

Let's make an moderately optimistic calculation:

  • you have 10000 disk-erases (you chose MLC, not the cheapest, not the fanciest)
  • you write 10GB/day (the order of magnitude for moderate everyday computer use)
  • ...of large blocks of data, so no write amplification (so 10GB writes is approx. 10GB of erase)
  • perfectly spread wear (e.g. a scratch disk that gets cleaned daily)

Ten thousand erase cycles doesn't sound like a lot, but keep in mind it's relative to the size:

Writing 10GB to a 10GB drive means one disk-erase every day; 10k erases at one per day is 27 years, writing 10GB to a 100GB drive means (best case) one disk-erase every ten days; 10k erases at one per ten days is 270 years


Now let's try for a more realistic (but not pathological) case. Let's say:

  • a portion of your writes are small, e.g. writing a few lines into many logs
even if not immediately flushed (which you do on some logs for debug reasons), the OS buffers tend to flush based on timeout, so writes are small.
If they really go to different blocks you will need to erase-and-rewrite, to write data a lot smaller than that block.
Let's express that as a higher average write amplification
  • you have uneven wear
e.g. half your drive is your music. These are never written to so do not wear, but are occupied so the other half wears twice as fast.
let's express that as twice the write amplification
  • You write more data than average for whatever reason

Let's say we wear the thing six times as fast.

That's still 4.5 and 45 years, respectively.


DWPD (Drive Writes Per Day) was instroduced as a more intuitive (in theory) way of thinking about this. One DWPD means you write the entire size's worth per day, which roughly means 1 erase per day, so e.g. 3DPWD with 4 year warranty was probably based on ~4000 erases, i.e. TLC. (verify)


Some manufacturers use TBW, Total Bytes Written, which entangles writes, erases, and an assumed write amplification factor (typically unmentioned, in which case you can assume it is an optimistic everyday-use one. Which is why this is a specific metric, and considered dubious for general use).

For example, a TBW of 64TB on a 120GB drive means, and an assumed amplification of 4, is approximately 2000 erase cycles.(verify)



Cautious footnotes:

  • Real numbers vary. This is order of magnitude stuff.
A few figures can easily vary by a factor three or four or so, for various reasons.
  • Most of the above refers to more casual use. Intense use goes much faster.
e.g. nearer-to-pathological case is constant writing, e.g. bulk serious number crunching that e.g. works in many "take 100GB, process it, write 100GB" steps.
That can be terabytes per day and make an SSD wear faster than any platter drive, and you're better off with some platter RAID. Or at least SLC. Datacenters tend to go for SLC.
  • Backup/archiving onto SSDs is easily one erase per backup. That should make it last long.
  • keep in mind that things beside the actual storage can break. Say, other involved electronics.
Numbers above a dozen years are less meaningful for this reason.

How do SSDs fail

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Reasons behind failure:

  • plain old component failure - seems to be a decent portion of real-world failures. Little you can do about this, though if the flash chips are unaffected you could in theory transplant them for recovery.
like any component failure. It's probably dead, with hopes only at data recovery labs
  • Unrecoverable bit errors - basically, every so often a bit can get stored wrong.
Because MLC and TLC use more voltage levels in the same cell, they have less leeway and are more likely to have bit problems when a cell degrades.
This is considered in design, in that they use more actual space for ECC (error correction) than fewer-level-per-cell variants, yet it's a little more complex than just that.
  • unreliable power, e.g. a shifty power supply or a power surge, has been known to corrupt data.
  • Firmware bugs.
Happened on some early models, though bugs still pop up.
  • Block erase failure - the predictable one.
After a while, a block cannot be erased/programmed. The firmware detects this, reallocates the data to a spare block, and mark the old one as dead
This makes the real question "what happens after all spare blocks are gone?", which is up to the firmware (and its interaction with the controller).


It's long been generally said that some SSDs are specifically designed to become read-only and not lose data (rather than go on and eventually fail more strangely).

There are two aspects to this:

whether they decide to stop working before they start going wonky,
what 'stop working' means.


tl;dr: this varies. Some fail gracefully, some only stay readable in the same reboot, some brick completely.

The last makes some sense in old-style RAID (failing early is better information towards the controller), but is very bad for personal SSDs.

So "You'll always be able to read from it" is a type of optimism that

Others do go to read-only (Note that this does not meant the filesystem is in a fully consistent state, but usually recoverable enough once copied off)




Power

SSDs tend to use between 0.5 and 1 Watt when idle, and between 1 and 4 Watts when active.

Which is about the same as a 2.5" hard drive, so don't expect SSDs to make your laptop battery last longer.

There is even some cases where SSDs are worse. A decent chunk of power use on HDDs is moving heads about, so sequential operations are not the worst case. SSDs don't have such a difference.

(Also, power per gigabyte is typically worse, at least for the smaller SSDs)


SSDs have not focused on power saving yet, so things may change.


See also

Most of my information came from Anand.

Including: