Computer data storage - Failure, error, and how to deal

From Helpful
Jump to: navigation, search
Computer data storage
These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.





tl;dr

TODO.


Things to read

  • Failure Trends in Large Disk Populations [1]
"the google one"
  • B Schroeder, G A Gibson, Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?[2]
"the CMU one"


Failure versus error

Disk failure refers to it dropping out in a "now you see it, now you don't" sort of way, and relates to the expected lifetime.

Disk error refers to localized inability to correctly recall data (and possibly to retain it), and typically happens considerably before the drive fails as a whole (though RAID and OSes can decide that a lot of errors means we should consider it failed).


This difference matters somewhat in general, and more so to larger-scale storage.


Failure is easier to discuss than errors. You need not care why a failure happens. A drive dropping out as a whole is something that you either did not plan for (bye data) or did (), and you can model the probabilities of individual failure, the interval of replacement, the likeliness of still losing data at a given point in time, and such.


RAID was made specifically to be robust in the face of random drive failure (random in the meaning of "timing not particularly predictable").

RAID did not focus on error. RAID implementations tend to deal with the easiest cases of errors, but not actually with the bulk of real-world cases (varies with implementation, and with how strong a guarantee you want).

On errors

Terminology

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Note that there are some similar and alternative terms with subtly different meanings, so mixing them can get confusing fast. I try not to :)


  • consumer class disks
most space for money is here
there is more quality variation in desktop-class than in enterprise-class drives.
This sounds obvious enough, but it is one thing that keeps many arguments alive (and somewhat circular)
(the difference between good desktop drives and enterprise disks seems to be lessening recently)
desktop class
often a direct synonym for consumer-class.
...sometimes used in contast with consumer-class NAS disks. In this case, desktop disks may be assumed to be powered off ~75% of the time. (because that lets them claim longer (wallclock) lifetime than if they were run 24/7)
  • enterprise class disks
better AFR / ARR than cheap-to-decent consumer class -- but not as much higher as you may think. Studies vary. (Vague-but-conservative estimate: 30% longer) (certainly better than the cheapest consumer class. But then, so are good consumer drives. Biggest difference between good consumer and enterprise is how easy it is to assume the ARR is decent)
lower error rate (real but apparently not as much as the specced)
Often twice the price of similar consumer models. In some cases this is worth it.
the idea is they have fewer issues during their lifetime. Compared to good consumer class disks they are overkill for many, but not all purposes. (and there are ways to deal with bad disks that, when you have large storage needs, are significantly cheaper than buying all enterprise disks)
It's easier to avoid iffy models by not buying consumer drives. Which you can read as:
"quality is more consistently high in enterprise drives, whereas good consumer models you have to know about"
"you pay for your own laziness" :)


  • ECC - the blanketing Error Correction (...Code)
in the context of disks, this often refers to the on-platter coding that lets the drive fixes fixable errors, and signals UREs (and uses ~2% of medium space)
the the context of storage servers also regularly refers to use of ECC RAM
  • transient read errors
mistakes ECC can both detect and fix. You don't see these, the OS doesn't see these (though you can read out how often they have happened)
  • UNC, Uncorrectable Data, (an ATA term?(verify)). This page also uses URE, Unrecoverable Read Error, as a synonym
detected read error: "yeah, sorry, this data from the surface doesn't check out"
more technically: mistakes ECC can detect but not fix
Latent, detected on read
The drive signals this. The layer above that (OS, or RAID) has to choose how to deal.
This is the error rate often specced in datasheets, typically "one bit in 1E14 bits" (or 1E15 or 1E16)
cause may be transient, but more commonly physical problems
  • UDE, Undetected Disk Errors
undetected read errors: "here's some data as per usual (that is incorrect, but I didn't even know that)"
undetected write errors (which cause detected and undetected read errors)
Latent, not necessarily detected on read (see details)
There is no spec-sheet rate for UDEs (and presumably no strong relation to UREs)
Some UDEs cause UREs, so UDEs are not fully distinct from UREs
  • Bad blocks/sectors
disk surface unable to retain data properly
Contrast with UREs: this is one of a few different reasons for an URE
Contrast with UDEs: UDEs are typically caused by head-tracking errors or firmware errors or other things that can corrupt a sector and lead to an URE, without the sector actually being bad
Apparently disks typically have a few bad blocks when brand new, which are considered production flaws, and not reported as reallocated sectors by SMART(verify).
  • LSE - the CMU study mentions Latent Sector Errors (LSE), which they basically define as "a bad sector or UNC" (that we won't notice until the next read)


  • MTBF - Mean Time Between Failure
"the predicted elapsed time between inherent failures of a system during operation" (quoth wikipedia)
if you have many drives, MTBF and a little math tells you the average interval you need to get off your chair to replace a drive
says relatively little about the expected lifetime of a single device, beyond its order of magnitude
real-world MTBF is probably at least a factor three lower than what is quoted, and regularly more - see discussion somewhere below (and all over the 'net)
  • AFR - Annual Failure Rate
Takes MTBF and turns it into "how likely is the drive to fail this year"
...and is only as accurate as MTBF is
...assumes MTBF is a constant figure over many years, which is flawed
Drive vendors say/imply 0.5 to 1% yearly on most drives -- but see ARR
  • ARR - Annual Replacement Rate
Typically based on some disk user's log of actual replacements
...so these are typically a specific datacenter's real-world data estimating AFR (still not ideal figures, e.g. in that not all replacements are because of failure)
ARR figures in studies:
Google: ~2% for the first two years, ~7% up to year five. (then high)
CMU: values between 1% and 5% in ~4 years(verify) (sometimes higher)
can vary significantly between brands, series, and even models in the same series.



There is a useful division between

  • A hard error/failure: refers to the sector failing by sitting on physically unreliable surface.
basically meaning a bad sector
causes can include
the head having scratched the surface due to it being hit with enough force during operation
manufacturing defects (...that are not detected before sale because they just passed the test, and we cannot tell how barely)
heat
  • A soft error/failure: refers to things failing for reasons not clearly physical
May be transient, e.g. due to head positioning errors, or when the drive does not get enough power(verify)
causes can include
a weak write (possibly after some electromagnetic decay)
an write to the adjacent track that was off-track

Unrecoverable Read Error (UREs)

Cases where a sector's data cannot be read back for any reason - it is verifiably wrong and cannot be corrected by the error correcting process.

(ATA calls it UNC, for uncorrectable)


UREs are latent in that the disk finds this out when it tries to read data, rather than when it was written (there is no double-checking read done at write time).

Yes, you can check uncover these latent errors by reading the disk from start to finish, e.g. via RAID scrubs, or a SMART extended test, but this is something you have to explicitly do (in part because this takes a while and degrades performance while being done).


Hard drive spec sheets mention an error rate, which is roughly how frequently you should expect an URE (...but this a flawed model, so take it as order of magnitude).


Their nature
How often

Undetected Disk Errors (UDEs)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

On failure

Failure generally refers to anything between:

  • catastrophic failure, as in a drive suddenly not being usable at all (hardware error, communication error, spinup problem, headcrash, etc.)
Not actually that common as most disks are treated moderately well (and decent quality)
  • the decision that "okay, it's starting to be a bad idea to continue using this disk".
For one of many possible reasons. Consider:
SMART has a pass/fail thing, based on vendor-defined thresholds. Usually not acted on, something/someone needs to read this out.
OSes may do nothing, follow SMART, and sometimes add their own logic, and may actively warn you
RAID has similar and further reasons to fail a drive out of an array


On MTBF

Intro

MTBF can fundamentally never say that much about the lifetime about a single product, because of statistics: we have a mean but an unknown distribution (best we get is an order of magnitude).


MTBF is useful to help calculate, basically, how often you'ld physically replace a disk when you have a lot of them running. (That is, the average amount of such actions over a couple of years should be as accurate as the MTBF figure is accurate).


Theoretically -- because disk vendors seem to use the bendiest of logic here.

This is not about lab conditions versus real world. This is about why most consumer disks are quoted to last 50+ years and enterprise 100+ years. That will never ever happen, not even in lab conditions (and the calculations are sort of funny).

Regardless of why, and how much, the result is that these MTBFs are useless for planning because relying on them means things will always fail earlier and cost more than you calculated.

Based on various real-world statistics (see AFR), these MTBFs are typically optimistic by at least a factor 4 and in some cases closer to factor 10.

Quality varies a bunch, so sometimes this is pessimistic. However, To avoid surprises and higher costs, you may want to model on figures like 30Khours to be careful, to at most 100Khours for known-to-be-high-quality.


Some real-world reference


So what's with these high MTBFs?

A note on AFR

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Annualized Failure Rate (AFR) takes MTBF and approximates the chance of failure within a year of continuous use:

AFR = 1-exp(-8760/MTBF)

Assumptions:

  • it is wearing all the time (8760 hours per year) - fair enough
  • an exponential failure curve - also fair enough
  • AFR is constant with age (when you assume that MTBF is)
though good enough for the first two years, after that known to be incorrect


The last makes AFR as semi-useful-at-best as MTBF.

Real-world such as Google's study (with a large population) shows AFR is under 2% in the first two years year and ~8% by year three)


AFR makes sense on a spec sheet for desktops and laptops as a note alongside MTBF.

Keep in mind that if the spec sheet mentions Duty Cycle that is not 100%, or a Power On Hours (POH) figure that is not 8760 (hours/year), then it is expected to be off.

If you do keep it on all the time, you can expect the real ARR to be a higher. Not necessarily the factor difference in hours, but certainly higher.
Conversely, taking a drive expected to be on all the time and powering it off most of the time may well make it live longer (until you hit another reason for failure)

I've seen Power On Hours figures like 2400 (~nine-to-five for 300 days, i.e. off ~70% of the time), around 4000 (off ~half of the time).


For example,

  • a 500000-hour MTBF used 100% of the time means an AFR of 1.7%
  • a 50000-hour MTBF used 100% of the time means an AFR of 16%
  • a 500000-hour MTBF used 28% of the time means an AFR of 0.4%
  • a 50000-hour MTBF used 28% of the time means an AFR of 4.3%
  • a 500000-hour MTBF with "Power On Hours": 2400" used all of the time means an AFR of 6%
  • a 50000-hour MTBF with "Power On Hours": 2400" used all of the time means an AFR of 58%

This is one reason you should be careful when you use desktop class for real data storage


Annualized Return Rate (ARR) is a figure you can keep track of as a business or datacenter. It's the actual number of returns per year.

When said business/datacenter registers/returns all disks, this is the best real-world estimate of AFR we have, and can also be a basis for estimation of a real-world MTBF.

-->

A few notes from the data integrity angle

RAID

UREs and RAID
On scrubbing

tl;dr:

  • do it.
  • More often means less likeliness of latent errors accumulating into read errors.
UDEs and RAID
Some calculations

"RAID is not backup" and "Plan for failure"

On ZFS and the likes

Some new filesystems are built with the assumption that drives will fail, and will produce errors before they do. They don't trust the disks to always do right, so add their own checks always.


If they additionally have redundancy (ZFS calls it RAID-Z), they can detect what copy/parity is good, correct the data, and verify that the correction is good.


In contrast:

  • little classical RAID can do verifiable repair (fundamentally)
  • on everyday reads, classical RAID often hands you data as-is without checking for stripe consistency (inconsistency would only be noticed at next scrub/rebuild). This is often a design choice, sometimes an optional feature.
  • All this comes at the cost of some speed (being software RAID)
(But it's a tweakable tradeoff for good guarantees, and ZFS is better tuned for the checks than most RAID can be without a bunch of redesign)

See also ZFS notes

On MTTDL

Unsorted

SCSI sense

http://blog.disksurvey.org/knowledge-base/scsi-sense/