Computer data storage

Failure, error, and how to deal (concepts)
Noticing errors and failure
- Reading SMART reports
Partitioning and filesystems
- ZFS notes
Network storage
RAID notes
- mdadm notes, aacraid notes, OMSA notes, LSI notes
General & RAID performance tweaking
SSD notes
LVM notes
Some glossary
Semi-sorted

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Price

Pricing of SSDs is for a large part the wafer manufacturing, which means that die shrinks will cut price {comment|(but moving to a new manufacturing process takes time and money, both of which meaning that the effect on prices consumers see is only gradual)}}

...so any other trick to store more on the same silicon is interesting. This is where the following section comes in:

SLC, MLC, TLC

You'll often see the terms SLC and MLC. More specifically:

SLC (single layer cell): 1 bit per cell
MLC (multi layer cell): 2 bits per cell
TLC (triple layer cell): 3 bits per cell (sometimes '3-bit MLC')
QLC (quad-level cell): 4 bits per cell (sometimes '4-bit MLC')

This is not actually a physical difference in the cells, it's about the way they are read out and written by the controller.

The obvious upside to more bits per cell is more storage per silicon, and is therefore therefore cheaper.

The downside is that storing reliably at a higher density takes more care, so more latency (and less throughput when using the same amount of flash chips, but that's rarely a problem because there are typically enough chips side by side). The wear is also different.

Lifetime

tl;dr:

For light to moderate use, an SSD should last no shorter than the average platter drive

...assuming SLC or MLC;

when considering TLC and particularly QLC, skip the math at your data's peril.

pathological cases will wear them faster than platter drives. Mostly heavy writes.

they don't necessarily have the properties for long term archiving. Some do, some don't.

The lifetime of flash is determined primarily by how many times you can erase-and-write each of its blocks

Order-of-magnitude numbers (there is significant variation within each type, manufacturing requirements, products, etc.)

SLC should last order of 100000 erases (verify) (which seems to mean "it'll probably fail in another way first")

for write-heavy workloads, these can work out as cheaper over time.

This and their lower latency makes them interesting for servers, e.g. database needs.

MLC should last at least 10000, though there are variants rated at 5000 or 3000 or so (verify)

(eMLC, enterprice MLC, a few multiples more),

generally a nice tradeoff for moderate use on regular PCs, in life for price

TLC should last at least 3000, though there are variants that are rated lower

QLC has variably been quoted as 100, or sometimes maybe 1000

TLC and particularly QLC gives us cheap SSDs, but their tradeoff is only worth it for read-heavy workloads, and the price-per-GB improvement for these seems less than the bit density suggests.

So it's vaguely like: "Low cost, good performance, high capacity, good lifetime: choose three"

...but not quite, because define performance. You can get a lot of throughput from TLC and QLC, even if it's not the lowest latency

How evenly you spend each of the block's erases depends on wear balancing.

How fast you spend these erases is determined

largely by how much you write to it

but also what kind of access patterns (e.g. small/random writes, how often you sync(), and all other reasons for write amplification.

Also note that a larger SSD of the same type of flash means the same amount of erases are spread to more flash blocks, lowering the rate at which each block gets erased.

...which is still relatively abstact. If you want some ballparks for lifetime in terms of, well, time , you need some assumptions.

Let's make some fairly optimistic assumptions:

you have 10000 disk-erases (you chose MLC, not the cheapest, not the fanciest)

for TLC, divide by 3

you write 10GB/day (the order of magnitude for moderate everyday computer use)
...of large blocks of data, so no write amplification (so 10GB writes is approx. 10GB of erase)
perfectly spread wear (e.g. a scratch disk that gets cleared daily)

Ten thousand erase cycles doesn't sound like a lot, but keep in mind it's relative to the size:

Writing 10GB to a 10GB drive means one disk-erase every day; 10k erases at one per day is 27 years,

writing 10GB to a 100GB drive means one disk-erase every ten days; 10k erases at one per ten days is 270 years

writing 10GB to a 1TB drive means ~three overall erases per year, and so on.

(numbers above a dozen years should probably be read as "good enough", and not as meaningful numbers, because chances are good something else will fail first)

DWPD, TBW

Maybe-easier ways of thinking about this?

DWPD (Drive Writes Per Day) was introduced as a more direct, labeled way to think about this.

One DWPD means you write the entire size's worth per day.

For example, if a drive mentions it's good for 3DPWD, and has a 4 year warranty, that works out as a few thousand erases, and you can guess it's probably TLC with the assumption that writes are 1:1 with erases.

TBW, Total Bytes Written is entangles writes, erases, and an assumed write amplification factor (typically unmentioned, typically optimistic, so potentially very sneaky) in a single figure.

Say, 2000 erases on a 120GB drive means you can write maybe 240TB to it. Assuming a amplification factor of 4, it'll probably stop working after 60TB.

That drive may be advertised as 60TBW (the implied unit is TB, should arguably be 60TB TBW).

Or, if the marketing department are asses, anyhing between 60TBW and 240TBW without any mention of amplification factor

That said, you usually only get the TBW, and still have to do the math yourself.

I like and dislike this metric for the same reason: it leaves too much open to you figuring out how you may use it.

Say,

the Samsung 870 EVO specs use a factor 600 between size and TBW,

which feels like 2000-erase TLC with a amplification factor for 4-ish, but who knows?

I'm looking at a 8TB 870 QVO with a 2880 TB TBW, which is only 360

(so if you use it to constantly write, at its maximum speed it will not last even half a year - so it's a poor scratch disk)

On unequal wear, and amplification factor

The previous bit sounds optimistic, and is. In reality you may find it wears factors faster.

Consider:

a portion of your writes are small

e.g. writing a few lines into multiple logs, as soon as possible (modern loggers may choose to flush more slowly, specifically to preserve SSDs)

OS buffers tend to flush based on timeout, so you may get a minimum rate of writes

but for debug around kernel panics and such you may want to flush faster

various writes go to two places (e.g. filesystem data and metadata are often in different blocks)

writes smaller than an SSD block still often mean a whole-block erase

if half of your storage never changes, may wear wear the other half twice as fast

(static wear leveling reduces this, but only so much. Also, you may not know what any particular SSD is doing exactly)

people may write more data than that ~10GB/day casual-workstation-use figure

intense use can be much faster. Consider e.g. serious number crunching like "take 100GB, process it, write 100GB" steps. That can be terabytes per day. (For such uses you may be better off with platter RAID. Or at least SLC, which datacenters tend to favour.)

Backup/archiving onto SSDs is easily one erase per backup

which is actually not that bad if that's all you do with it

and may be much less if you can get it to work incrementally

some SSDs compress data (this can be a net-positive effect)

The combined effect is often expressed as amplification factor, with the meaning of 'it will wear this factor faster than you thought based just on size'.

If you like your data, you should be pessimistic about amplification factor. Assume it's 2 to 4.

So if you like careful estimates, then instead of those 27 and 270 years mentioned earlier, it's a much more sober 4 and 40 years. Which isn't any shorter than platter.

On wear leveling

Raw flash storage has no wear leveling.

In a pathological case you will have one sector wear out hundreds of times faster, and once one single block (or even cell) becomes unusable, the device may become useless as a whole.

This is why we've learned not to trust memory cards and USB sticks for serious storage. That, and the fact you're probably buying the cheapest you can find.

SSDs, on the other hand, will typically do dynamic wear leveling, meaning that the drive chooses whatever free block it wishes, noting where it put it so that the logical addressing that the OS uses is always the same.

Roughly, this means all actively written areas wear equally quickly.

(On platter this would be slower (because head seeks) and is only done for retired sectors replaced with spare sectors, but on SSD you can do this for all data basically for free.)

Now consider the case where half your data never changes.

In a simple implementation, those blocks would never get any erases, while the other half of the blocks would wear twice as fast. Or, when it's 90% full like most people's hard drives, would wear the rest 10 times as fast.

Dynamic wear leveling applies only to new writes, so has no effect on this.

Static wear leveling means that the SSD will, in the background, move the more static data around, freeing up the less-written blocks for more active use.

Note that this isn't free -- it makes things a little slower because it may at any time be moving a bulk of data around. More significantly, doing this at all is a lot of writes and therefore erases. It's a statistical game, but net-positive if the difference in block erase counts grows large enough.

Notes

neither type of wear leveling reduces the fact that there is amplification factor

it just reduces the amount of negative effect

some recent filesystems are designed to do wear leveling on top of storage that doesn't

dynamic and static wear leveling may be implemented not over the entire SSD, but within specific parts(verify)

Presumably there are no-name brands with poor wear leveling.

This is one reason you sadly have to care about brands. This should work because larger names don't want to become known the ones that are significantly worse than the rest.

Much of this can be unknown to consumers, though (regardless of brand)

Say, did you know this is why SD cards are terrible and complete unknowns?

Do you know the failure mode of any hardware?

On speed, and the reasons it varies significantly between uses

Compared to a decent HDD:

random write: For cheaper drives, this is on the order of HDD speeds, sometimes worse. On good drives it's slightly better.
random read: faster
sequential read: faster
sequential write: similar
speed over time: degrades as well (but but different reasons; see below)
application launch: faster, because this is mostly (and relatively random) reads
application tests: slightly faster, but often hard to tell because there's little IO most of the time. Running many thing at the same time will be faster in that IO won't degrade if it's mostly reads

Some ballpark figures, for MLC

Random write
- Less than <1 MB/s on some cheap MLC drives, down to .05 or so in the worst cases
- 1..15 MB/s on moderate MLC
- up to 20 or 30 or 40 on good MLC drives, and on SLC
- HDDs: ~1MB/s.
Random read
- 15..50 MB/s (latency often .2 to .7 ms)
- HDDs: <1 MB/s
Sequential write
- 70..180 MB/s (some down to 45)
- HDDs: ~100 MB/s
Sequential read
- 100..250 MB/s
- HDDs: ~100 MB/s

Maximum sequential throughput benchmarks are stupid. There are few real-world use patterns that are sequential and fast enough to use all that speed, and do something useful in the process. Yes, some large-scale well-optimixed number crunching. Little server use. Not your desktop.

Also, how drives manage their data affects what it is tuned for. Those with high throughput figures may actually do worse for your real, random workloads - and reviews that report only throughput are indirectly asking for that, because we buy these things based on something we can easily grasp.

Most everyday OS and application work is random reads and random writes. (In somewhat consistent and somewhat predictable areas, but random for most real purposes)

SSD random read latency is simple: It's better than HDD's. It's a major reason applications launch faster, why applications running in parallel often bother each other less (depending on the application), and why load time in games tends to be lower (that one can also depend a lot on game-specific tuning for HDDs).

SSD random write latency varies. It depends on a lot of lower-level details, how clever the OS is, and what applications actually do. In bad cases can be worse than HDD's. Since random writes are fairly common in many workloads, random write performance one of the most significant performance metrics of an SSD, unless you are buying it for a specific use where you know it isn't (e.g. a cache drive). (There were some early models, and some current cheaper ones, which had very high random write latency, and as a result, fairly miserable random write throughput that any HDD could beat)

More details

A page is a smallish set of cells, the smallest that can be written, and are currently often 4KB (larger on a few models).

Pages are grouped in blocks. A block is the smallest unit that can be erased, often 64 or 128 pages, so currently on the order of 256KB or 512KB or so.

Pages, blocks, and write amplification

Spare blocks, overprovisioning

Spare blocks, wear, and speed degradation

TODO: figure out in enough detail to actually write about it

On TRIM

How do SSDs fail

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Reasons behind failure:

plain old component failure -

seems to be a decent portion of real-world failures

Note that if the flash chips are unaffected, a recovery lab can may be able to transplant them for recovery.

like any component failure. Assume it may be dead, so never have just one copy. Like any storage.

Unrecoverable bit errors - basically, every so often a bit can get stored wrong.

Because MLC and TLC use more voltage levels in the same cell, they have less leeway and are more likely to have bit problems when a cell degrades.

This is considered in MLC/TLC design, in that they use more cells for ECC (error correction) than fewer-level-per-cell variants -- yet it's a little more complex than just that.

unreliable power, e.g. an overloaded power supply or a power surge, has been known to corrupt data.

Firmware bugs.

Happened on some early models and much less so now

Block erase failure - the predictable one.

After a while, a block cannot be erased/programmed. The firmware detects this, reallocates the data to a spare block, and mark the old one as dead

This makes the real question "what happens after all spare blocks are gone?", which is up to the firmware (and its interaction with the controller).

Okay, but how do they fail

If you've heard SSDs are designed to become read-only and not lose data.

That depends.

So it is a very bad thing to assume.

In reality,

some become read-only -- but not in reaction to failure, but based on "it's probably worn" statistics

some fail gracefully and are easily recovered,

some only stay readable in the same reboot,

some brick completely, never to show up in your device list ever again

Unless you know the specific model, and that it never changed with firmware versions, then you won't know for sure what yours does until it does it, so if you care about your data, you should absolutely never make optimistic assumptions.

Even if it does the intentional read-only thing, it's not guaranteed that it does so before something fails, but chances are at least better.

Many give decent early warning, via tooling. Sometimes also by becoming noticeably slower (...but you may not notice that in servers, or even necessarily workstations).

You should not assume reallocated sectors are gradual. Sometimes they are, sometimes it's quickly overwhelmed.

Power

SSDs tend to use between 0.5 and 1 Watt when idle, and between 1 and 4 Watts when active.

Which is about the same as a 2.5" hard drive, so don't expect SSDs to make your laptop battery last longer.

There is even some cases where SSDs are worse. A decent chunk of power use on HDDs is moving heads about, so sequential operations are not the worst case. SSDs don't have such a difference.

As of this writing, SSDs have not focused too much on power saving yet, so things may change.

Computer data storage - SSD notes

Contents