Computer data storage - Failure, error, and how to deal

Computer data storage

Failure, error, and how to deal (concepts)
Noticing errors and failure
- Reading SMART reports
Partitioning and filesystems
- ZFS notes
Network storage
RAID notes
- mdadm notes, aacraid notes, OMSA notes, LSI notes
General & RAID performance tweaking
SSD notes
LVM notes
Some glossary
Semi-sorted

📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.

tl;dr

TODO.

Things to read

Failure Trends in Large Disk Populations [1]

"the google one"

B Schroeder, G A Gibson, Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?[2]

"the CMU one"

Failure versus error

Disk failure refers to it dropping out in a "now you see it, now you don't" sort of way, and relates to the expected lifetime.

Disk error refers to localized inability to correctly recall data (and possibly to retain it), and typically happens considerably before the drive fails as a whole (though RAID and OSes can decide that a lot of errors means we should consider it failed).

This difference matters somewhat in general, and more so to larger-scale storage.

Failure is easier to discuss than errors. You need not care why a failure happens. A drive dropping out as a whole is something that you either did not plan for (bye data) or did (), and you can model the probabilities of individual failure, the interval of replacement, the likeliness of still losing data at a given point in time, and such.

RAID was made specifically to be robust in the face of random drive failure (random in the meaning of "timing not particularly predictable").

RAID did not focus on error. RAID implementations tend to deal with the easiest cases of errors, but not actually with the bulk of real-world cases (varies with implementation, and with how strong a guarantee you want).

On errors

Terminology

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Note that there are some similar and alternative terms with subtly different meanings, so mixing them can get confusing fast. I try not to :)

consumer class disks

most space for money is here

there is more quality variation in desktop-class than in enterprise-class drives.

This sounds obvious enough, but it is one thing that keeps many arguments alive (and somewhat circular)

(the difference between good desktop drives and enterprise disks seems to be lessening recently)

desktop class

often a direct synonym for consumer-class.

...sometimes used in contast with consumer-class NAS disks. In this case, desktop disks may be assumed to be powered off ~75% of the time. (because that lets them claim longer (wallclock) lifetime than if they were run 24/7)

enterprise class disks

better AFR / ARR than cheap-to-decent consumer class -- but not as much higher as you may think. Studies vary. (Vague-but-conservative estimate: 30% longer) (certainly better than the cheapest consumer class. But then, so are good consumer drives. Biggest difference between good consumer and enterprise is decent certainty)

lower error rate (real but apparently not as much as the specced)

Often twice the price of similar consumer models. In some cases this is worth it.

the idea is they have fewer issues during their lifetime. Compared to good consumer class disks they are overkill for many, but not all purposes. (and there are ways to deal with bad disks that, when you have large storage needs, are significantly cheaper than buying all enterprise disks)

It's easier to avoid iffy models by not buying consumer drives. Which you can read as:

"quality is more consistently high in enterprise drives, whereas good consumer models you have to know about"

"you pay for your own laziness" :)

ECC - the blanketing Error Correction (...Code)

in the context of disks, this often refers to the on-platter coding that lets the drive fixes fixable errors, and signals UREs (and uses ~2% of medium space)

the the context of storage servers also regularly refers to use of ECC RAM

transient read errors

mistakes ECC can both detect and fix. You don't see these, the OS doesn't see these (though you can read out how often they have happened)

UNC, Uncorrectable Data, (an ATA term?(verify)). This page also uses URE, Unrecoverable Read Error, as a synonym

detected read error: "yeah, sorry, this data from the surface doesn't check out"

more technically: mistakes ECC can detect but not fix

Latent, detected on read

The drive signals this. The layer above that (OS, or RAID) has to choose how to deal.

This is the error rate often specced in datasheets, typically "one bit in 1E14 bits" (or 1E15 or 1E16)

cause may be transient, but more commonly physical problems

UDE, Undetected Disk Errors

undetected read errors: "here's some data as per usual (that is incorrect, but I didn't even know that)"

undetected write errors (which cause detected and undetected read errors)

Latent, not necessarily detected on read (see details)

There is no spec-sheet rate for UDEs (and presumably no strong relation to UREs)

Some UDEs cause UREs, so UDEs are not fully distinct from UREs

Bad blocks/sectors

disk surface unable to retain data properly

Contrast with UREs: this is one of a few different reasons for an URE

Contrast with UDEs: UDEs are typically caused by head-tracking errors or firmware errors or other things that can corrupt a sector and lead to an URE, without the sector actually being bad

Apparently disks typically have a few bad blocks when brand new, which are considered production flaws, and not reported as reallocated sectors by SMART(verify).

LSE - the CMU study mentions Latent Sector Errors (LSE), which they basically define as "a bad sector or UNC" (that we won't notice until the next read)

MTBF - Mean Time Between Failure

"the predicted elapsed time between inherent failures of a system during operation" (quoth wikipedia)

if you have many drives, MTBF and a little math tells you the average interval you need to get off your chair to replace a drive

says relatively little about the expected lifetime of a single device, beyond its order of magnitude

real-world MTBF is probably at least a factor three lower than what is quoted, and regularly more - see discussion somewhere below (and all over the 'net)

AFR - Annual Failure Rate

Takes MTBF and turns it into "how likely is the drive to fail this year"

...and is only as accurate as MTBF is

...assumes MTBF is a constant figure over many years, which is flawed

Drive vendors say/imply 0.5 to 1% yearly on most drives -- but see ARR

ARR - Annual Replacement Rate

Typically based on some disk user's log of actual replacements

...so these are typically a specific datacenter's real-world data estimating AFR (still not ideal figures, e.g. in that not all replacements are because of failure)

ARR figures in studies:

Google: ~2% for the first two years, ~7% up to year five. (then high)

CMU: values between 1% and 5% in ~4 years(verify) (sometimes higher)

can vary significantly between brands, series, and even models in the same series.

There is a useful division between

A hard error/failure: refers to the sector failing by sitting on physically unreliable surface.

basically meaning a bad sector

causes can include

the head having scratched the surface due to it being hit with enough force during operation

manufacturing defects (...that are not detected before sale because they just passed the test, and we cannot tell how barely)

heat

A soft error/failure: refers to things failing for reasons not clearly physical

May be transient, e.g. due to head positioning errors, or when the drive does not get enough power(verify)

causes can include

a weak write (possibly after some electromagnetic decay)

an write to the adjacent track that was off-track

Unrecoverable Read Error (UREs)

Cases where a sector's data cannot be read back for any reason - it is verifiably wrong and cannot be corrected by the error correcting process.

(ATA calls it UNC, for uncorrectable)

UREs are latent in the sense that the disk does not know of this detectably-bad state until it actually reads it, even if its own write introduced it (there is no double-checking read done at write time).

Yes, you can check uncover these latent errors by reading the disk from start to finish.

Yes, RAID scrubs or a SMART extended test will do this, but this is something you have to explicitly set up -- and may not, because this takes a while and degrades performance while being done.

Hard drive spec sheets mention an error rate, which is roughly how frequently you should expect an URE (...but this a flawed model, so take it as order of magnitude).

Their nature

How often

Undetected Disk Errors (UDEs)

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

On failure

Failure generally refers to anything between:

catastrophic failure, as in a drive suddenly not being usable at all (hardware error, communication error, spinup problem, headcrash, etc.)

Not actually that common as most disks are treated moderately well (and decent quality)

the decision that "okay, it's starting to be a bad idea to continue using this disk".

For one of many possible reasons. Consider:

SMART has a pass/fail thing, based on vendor-defined thresholds. Usually not acted on, something/someone needs to read this out.

OSes may do nothing, follow SMART, and sometimes add their own logic, and may actively warn you

RAID has similar and further reasons to fail a drive out of an array

On MTBF

Intro

MTBF can fundamentally never say that much about the lifetime about a single product, because of statistics: we have a mean but an unknown distribution (best we get is an order of magnitude).

MTBF is useful to help calculate, basically, how often you'ld physically replace a disk when you have a lot of them running. (That is, the average amount of such actions over a couple of years should be as accurate as the MTBF figure is accurate).

Theoretically -- because disk vendors seem to use the bendiest of logic here.

This is not about lab conditions versus real world. This is about why most consumer disks are quoted to last 50+ years and enterprise 100+ years. That will never ever happen, not even in lab conditions (and the calculations are sort of funny).

Regardless of why, and how much, the result is that these MTBFs are useless for planning because relying on them means things will always fail earlier and cost more than you calculated.

Based on various real-world statistics (see AFR), these MTBFs are typically optimistic by at least a factor 4 and in some cases closer to factor 10.

Quality varies a bunch, so sometimes this is pessimistic. However, To avoid surprises and higher costs, you may want to model on figures like 30Khours to be careful, to at most 100Khours for known-to-be-high-quality.

Some real-world reference

So what's with these high MTBFs?

A note on AFR and ARR

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Annualized Failure Rate (AFR) takes MTBF and approximates the chance of failure within a year of continuous use.

Annualized Return Rate (ARR) is a figure you can keep track of as a business or datacenter - basically just counting how many disks actually faled, which is roughly the best real-world estimate of AFR we have, and can also be a basis for estimation of a real-world MTBF.

If you don't know the ARR for a disk, you might model its AFR based on some seemingly-known figures:

AFR = 1-exp(-8760/MTBF)

Assumptions:

it is wearing all the time (8760 hours per year) - fair enough

an exponential failure curve - also fair enough

AFR would stay constant with age

and MTBF is accurate, and constant with age

The last two assumptions are probably good enough for the first two years, after that both are known to be incorrect (real-world such as Google's study (with a large population) shows AFR is under 2% in the first two years year and ~8% by year three).

Which basically means AFR is a poor model of your actual failures, when calculated based on MTBF, and sort of in general.

But also, keep in mind that one way to inflate life expectancy numbers is to assume it's off much of the time, so watch for mentions of Duty Cycle that is not 100%, or a Power On Hours (POH) figure that is not 8760 (hours/year).

I've seen Power On Hours figures like 2400 (~nine-to-five for 300 days, i.e. used only ~30% of the time), around 4000 (off ~half of the time).

For this reason, something might fail faster in server use. And, indeed, fail a lot later in a laptop you only use for a few hours per week.

This is one reason you should be careful when you use desktop class disks for serious data storage.

So AFR does make some sense sense on a spec sheet for desktops and laptops, as a note alongside MTBF, if you know what it means.

In theory, for example,

a 500000-hour MTBF used 100% of the time means an AFR of 1.7%
a 50000-hour MTBF used 100% of the time means an AFR of 16%

a 500000-hour MTBF used 28% of the time means an AFR of 0.4%
a 50000-hour MTBF used 28% of the time means an AFR of 4.3%

a 500000-hour MTBF with "Power On Hours": 2400" used all of the time means an AFR of 6%
a 50000-hour MTBF with "Power On Hours": 2400" used all of the time means an AFR of 58%

A few notes from the data integrity angle

RAID

UREs and RAID

On scrubbing

tl;dr:

do it.
More often means less likeliness of latent errors accumulating into read errors.

UDEs and RAID

Some calculations

"RAID is not backup" and "Plan for failure"

Unsorted

SCSI sense

http://blog.disksurvey.org/knowledge-base/scsi-sense/

Computer data storage - Failure, error, and how to deal

Contents

tl;dr

Things to read

Failure versus error

On errors

Terminology

Unrecoverable Read Error (UREs)

Their nature

How often

Undetected Disk Errors (UDEs)

On failure

On MTBF

A note on AFR and ARR

A few notes from the data integrity angle

RAID

UREs and RAID

On scrubbing

UDEs and RAID

Some calculations

"RAID is not backup" and "Plan for failure"

Unsorted

SCSI sense

Navigation menu