Computer data storage - RAID notes

From Helpful
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Computer data storage
These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.


The various types of RAID, and what they do (and don't do)

Perhaps the first lesson of RAID is that it is built for speed rather than data protection.


Yes, even the redundant ones.

They cover some failure modes, and that's awesome,

...but not others, and that's shitty footnotes to find out after you've lost most of your data


That said, none of the alternatives cover 100% of failure modes.

If you care about your data, your job is to be aware of failure modes, and having the risks well quantified.

If you don't do that, RAID can create a false sense of security.



Official RAID levels

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Smin is the size of the smallest disk involved (In most cases you would use same-sized disks).

Usually the most interesting among the classical set are:

Type Basic description Read performance Write performance Space when using n disks Robust to.... Useful for
RAID0 Striping: Data is chunked up and spread among all drives (no redundancy) more throughput,
decent latency (verify)
more throughput,
decent latency (verify)
n×Smin nothing Highest speed, zero safety.

Any one drive failure takes a bite out of all data. Useful where processing speed is paramount, and data ca be easily restored / generated again.
Latency increases slightly with more disks.
RAID1 Mirror: Keeps exact copy of data on 2 drives (or more) same/more throughput
good latency (verify)
same throughput
decent latency (verify)
Smin (n-1) disk failures
(data is fine as long as one disk still works)
No extra space, little extra speed, quite safe.

Note 3-disk mirrors are safer than 2-disk (because details). Latency and read speed can be better than one disk's, but need not be - depends on implementation and loads.
Write latency and speed can be faster than a single disk when you're not saturating its bandwidth (since the controller can choose to postpone the other disk(s)'s write(s), though this is less safe without battery).
Size is colored orange because this is expensive per space.
You can't make larger arrays without nested levels.


RAID5 Striping plus distributed parity more throughput, good latency (verify) more throughput, some latency cost (verify) (n-1)×Smin 1 disk failure Increased space, speed, and safety.

Good for basic safety while not giving up much space.
Best for a handful of disks - fewer and more of your space is parity, more and safety reduces.
Any one drive failure will means no data loss.
May deal better with concurrent access than RAID3, RAID4.
Write speed scales less than linearly because of parity overhead (as does read speed when you actually check it's consistent -- which most don't do and leave up to scrub-time checks). For non-sequential workloads, parity RAID is often noticeably slower than e.g. RAID10, which for a handful of disks can be the better choice.

RAID6 Striping plus distributed parity more throughput, decent  latency (verify) more throughput, some latency cost (verify) (n-2)×Smin 2 disk failures Basically RAID5 that is robust to two drive failures instead of one, which makes more sense on larger arrays

Slightly slower than RAID5.


Notes:

  • Yes, RAID2, 3, and 4 are part of the official list.
You can consider them variants RAID5, most of them less pragmatic at least in general-purpose RAID, though there are some specific modern uses
  • When you care about IOPS
    • accurate calculations are more complex than you think
    • ...but ballwarkwise, all of them scale roughly with amount of disks linearly, but less quickly in the safer ones. Basically, RAID0 > RAID10 > RAID5(0) > RAID6(0).
  • FAST can refer to RAID0
  • SAFE can refer to RAID1
  • BIG can refer to spanning
  • Under RAID5 and RAID6, it is typical that reads are served from data blocks, not parity blocks (yes, there are variants that optionally always read the parity to check, but this doesn't happen for most smallish-scale variants)


Unofficial RAID levels

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Unofficial RAID types, variants, and some related concepts, include:

Type Basic description Speed Space when using from n disks Robust to.... Useful for
RAID1+0 Striping of mirror-sets Compared to RAID5/6, this has better random access, latency, i.e. better IOPS, which is useful for databases and such.
Safety can scale better than RAID5, but is more expensive for the same space, and layout is a little more restricting.
There are some interesting variations of this. For example, see the details of linux md's near/far variants.
RAID0+1 Mirrors of stripe-sets Similar to RAID1+0, but less robust in most configurations - because usually the stripe sets will be larger, and they fail as a whole. (verify)
RAID5+0 and RAID6+0 stripe-set of parity-striped sets Stripes between two (or more) RAID5/RAID6 sets.

Basically a RAID5/RAID6 variant better suited to arrays of 10-20 disks or more, because it lets you have more parity disks than having a RAID5 or RAID6 of the same number of disks.
RAID5+1, RAID6+1 mirror set of parity-striped sets More robust than most other RAID variants, but more than half of all disks are tied up in safety or redundancy, so is expensive per space.
Type Basic description Speed Space when using from n disks Robust to.... Useful for


JBOD 'Just a Bunch Of Drives' varies with use all nothing Using drives as-is. Not RAID.
On RAID controllers, this refers to the option to see and use drives separately instead of exposing the array as a device.
Sometimes refers to splitting files onto multiple drives to get RAID0-like speed/latency improvements when all these files are in common use.
Takes work to do well and offers no safety.
spanning, a.k.a. linear, SPAN, BIG drives are appended to look like single drive Gives you all the space of all drives together, and doesn't require drives to be of the same size.

No robustness - fails as a whole (though in theory, contiguous files can be picked off the disk)
No striping to help throughput (may be faster if multiple requests's data happens to come from distinct drives, but there is no planning that).

multipath multiple cables/ports communicating with one drive Primarily robustness to failure of things beside the disks themselves. Depending on the actual hardware used, can be made to deal transparently with failure of one cable, switch, or controller.
Some implementations will actively use the redundant IO paths (when it helps throughput).
SLED 'Single Large Expensive Drive' Single drives that store a lot of data, and usually are also physically larger than typical drives. Not very common, partly because RAID scales better.
MAID 'Massive Array of Idle Disks' Near-online (a.k.a. nearline): Having many storage devices that are off, possibly just powered off, or possibly on shelves somewhere. Can be accessed within a time that is reasonable for their use (e.g. hours).
Primarily useful for longer-term archival storage, as most storage lasts longer when not kept running continuously. Also sometimes done just to save power.


RAIN 'Reliable/Redundant array of independent/inexpensive nodes' Using a many-host cluster in a RAID-like way. Details, and particularly failure modes, vary. Often have a filesystem interface rather than a block interface.


Type Basic description Speed Space when using from n disks Robust to.... Useful for
RAID1E Keeps two copies (not more?(verify)) on adjacent disks (E for enhanced)
The main difference with official RAID1 is that data is assigned to disks in a round-robin way, meaning you aren't tied to an even number of disks.
RAID5E Variant that actively uses its hot spare for data (to get its extra IO bandwidth).
Keeps enough unused free space (at the end of each drive) so that a degraded array can be rebuilt into the drives that are left. In terms of safety it is similar to RAID5 with a present spare -- except that it will only be robust to a non-spare failure once it it rebuilt into the given drives (which may easily take a day or two)
RAID5EE Like RAID5E, but with the space throughout the array instead of at the end of each disk.
RAID6E Like RAID5E, but with two drives' worth of reserved space instead of one.


See also:


Things that don't make much sense:

  • RAID0+5, RAID0+6 - RAID5 or 6 at the top, each member of which is a stripe set. Each such stripe set fails as a set, so failure probability is somewhat higher than in even plain RAID5 or 6 without the striping. Also doesn't help performance much.
  • RAID1+5, RAID1+6 - RAID5 or 6 at the top, each member of which is a mirror set. Each such mirror makes it safer than plain RAID5 or 6. But doesn't do much for speed. RAID1 doesn't necessarily use both disks.

Optional RAID features

Hardware RAID, software RAID, fakeRAID

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Hardware RAID refers to hardware which is functionally self-contained: dedicated processor, presents only the resulting logical storage.

Software RAID refers to doing everything from the OS (management, calculation) and on your CPU.

FakeRAID is basically is hardware that turns out to do software raid: It has sockets for disks, but no processing power.


People compare these on roughly two or three fronts:

  • convenience:
hard: your OS needs to know nothing about this RAID, all it sees the resulting drive
soft: your OS needs to be told about the drives. The individual ones are also visible. You possibly can't boot off it.
fake: you generally need to install driver/program to even see the logical drive. Which may not exist for your OS. You typically can't boot off it.
  • performance
hard: dedicated CPU, dedicated bus to disk, read caches, and write buffers tend to mean it's easier to guarantee more consistent and and often higher speeds
soft: can be as good as hard, but more easily affected by work that the OS
fake: mostly the same as soft


A sort-of-third is that hardware can be a little safer, e.g. around OS crashes. This depends a lot on implementation and sometimes even goes the other way.


Also, a counterargument is that each hardware RAID controller is a single point of failure that software RAID usually is not, in that the hardware RAID model may be a unique implementation, and if it fails you need to have a spare controller or risks not being able to read perfectly fine data (without sending it off to a data recovery lab).

You can argue that this is actually a case where fakeraid is sometimes accidentally best. Sort of maybe.



External RAID

There are RAID boxes that are networked (NAS) - but also those that can be connected via eSATA (so DAS - Direct Attached Storage), USB3, or FireWire.

There are fancy SAS solutions that are easier and better for rack-like solutions.


Note that to a controller, eSATA don't really look very different from internal drives. Something similar often goes for SAS enclosures.


SATA RAID enclosures are nearly equivalent to putting the drives in your computer, but you don't have to worry about details like their power supply, having enough physical space for the drives, trays, and such.

They tend to consist of little more than a small case and a power supply, and either:

  • a RAID controller inside, exposed via eSATA (and possibly more, but it seems that USB3 is proving to be slower than eSATA)
    • is more portable - whatever has an eSATA plug sees it as a single large drive
    • apparently often limited to 16TB (hardware?, firmware?)
  • ...or a SATA port multiplier that lets you connect a few drives via one SATA cable. Often hooked up via (one or more) eSATA plugs
    • You often have to use a specific controller card (in part because simple SATA controllers may not understand port multipliers)
    • You often have to use their specific software
    • Can't really be taken along and plugged into another computer.
    • Size and speed limitation is up to the controller card you get.
    • Note that port multipliers typically connect at most 4 or 5 drives (apparently for speed reasons, many can serve more. This seems one reason many products have up to 4 or 5 bays).


RAID-Z

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

RAID-Z is the software RAID you can do with the ZFS filesystem.

It does analogues of RAID1, RAID0, RAID5, RAID6, (one variant with even more parity), and nesting. See also Computer data storage - ZFS notes#Some_words_on_layouts


It differs in implementation to typical hardware or software RAID, though.

Some differences that relate to your data:

  • it always hashes your data (which makes writes slightly slower)
  • readsalways checks data with those hashes.
When there is some source of redundancy (and the source of errors is the disk - UREs and UDEs), it can very typically self-heal during that read.
This in contrast to a lot of current hardware RAID, which tends to just pass through data assuming that the non-parity blocks are correct. When you care not only about 'will I still have data when one drive drops out' but also about data integrity, you should care about this
  • It uses a bit more metadata (because of ZFS, and because of variable stripe size)
  • It tries to avoid the unnecessary parts of the read-calculate-write cycle that RAID5 and 6 do.
This means it can be safer (no [RAID#Atomicity:_including_parity_inconsistency_due_to_system_crashes write hole]) and write faster (certainly than software+fakeRAID RAID5/6), though not always (can be slower the pool is almost full(verify))
also not that because ZFS is quite copy-on-write, it will generally be somewhat slower (but with enough disks, it may not be the bottleneck)

RAID uses, RAID limitations

RAID is for speed and/or uptime, not for data safety

RAID's design case was being resistant to a drive dropping out.


As a side effect of having many disks, you can also make larger and faster filesystems, and for scratch space this can even be an interest more important than safety -- which is why the completely unsafe RAID0 is used a bunch.


But adding more disks naively will increase risks because you have more moving parts that can each fail, in ways you can model with basic probability.


So RAID is not backup, and it is not even a good form of implementing backup unless and until you can show the math behind why your specific specific setup is resistant to all the predictable failure modes.


Yes, RAID gives some resistance to UREs, that is, sectors signaled as "data disappeared, sorry"). Decently behaved RAID can behave better around UREs than than a single disk with UREs, but don't count on it: a lot of implementations are fairly passive and trusting about this.

In other words, don't confuse fault tolerance with data integrity


Other limitations include

that the URE rate of your disks limits the amount of disks you should put in one array,
that it cannot provide data integrity (it does help, though),
UREs are not noticed in time without scrubs (and 'in time' is balanced with how often you do do scrubs)
does not deal with UDEs that led to silent corruption

A lot of RAID does not guarantee correct reads

the need for scrubs

Games of statistics

Corruption will probably happen, and will probably be silent

What about my filesystem?

Uptime doesn't mean uninterrupted good data

The fact that the disk array is happy is pretty irrelevant if bad data makes it through.


Once an error makes it through, what happens will vary.

Software that pays no attention may just deal with corrupt data, and do anything from never notice to crash
If it appeared in fs/db journaling, it will probably be noticed and worked around somehow.
if it appeared in filesystem metadata, the OS may decide things are Very Bad and, for example, remount read-only,

which tends to have an effect on many services.

The need for batteries/NVRAM

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

The write hole refers to the issue that if you replace data, failing (e.g. power failure, crash) in the middle of writing that data means it's now a mix of old and new data that is probably corrupt and unrecoverable.

In most cases it is possible to detect it is wrong but impossible to detect what the data should have been.

If you care about atomicity, never do this - the easiest fix is to write the new data separately and retire the old version. But this is more work, more bookkeeping, so not everyone does it.


In the context of RAID parity it leaves two devices inconsistent with each other, but is otherwise exactly the same problem.

In RAID, this corruption may well be silent, in that it will only be noticed in the next scrub (if enabled), or at rebuild (much later, and probably not the time you want to find out about it).

Even just reading the data may not notice - this depends on whether correctness is checked on every read. systems focused more on speed than correctness may not do that.


There are two basic approaches to avoiding this:

  • never have power failures (as in, use UPS) or crashes (you're funny)
    • as long as the system actually stops writing before it runs out (easiest to do via a forced shutdown).
  • ensuring that data will eventually make it to disk.
    • RAID vendors offer battery-backed memory or NVRAM, so that the operations in the controller's queue will finish when the system comes up again (NVRAM is often slower but won't be defeated by you forgetting to monitor the battery).
      • (this itself is only valuable when drives acknowledge a write when it's actually on platter -- some lie and acknowledge earlier)


The write hole is also a (write) performance limitation, because the system needs a read-calculate-write cycle for each write.

Plan for failure

Disks will fail. Whether it's after two or five or ten years, they will fail.

The longer you stay optimistic, the more you are doing a Russian data roulette thing.


Adding drives works both ways here.

More drives means a single drive failure -- but also adds individual points of potential failure, so how that balances depends on details. RAID(1, 5, 6, and most hybrids) tempers the odds against random drive failure by allowing one or more drives to drop out without interrupting service.


It does not protect against user error, malicious intent, firmware bugs, software bugs, bad sectors and other pre-failure errors the drive makes, or anything else that changes/corrupts the data.


When we say backup, we usually mean archival backup: A whole copy or two of everything. From long enough ago that, say, a a PhD student who deleted their thesis can make their way to the helpdesk and through the red tape in time to get a somewhat recent copy.

Other observations

There is a subtle difference between unpredictable failure and aging drives.

Since this is hard to separate in practice, we usually don't care. Broken is broken.


However, installing 40 drives of the same model and from the same production batch makes it likelier that many will fail within a year of each other (depending on quality. I've worked with bad enough drives that they failed without much pattern), which puts more pressure on your spares cabinet.


You need a spares cabinet, and you need it stocked well enough for the amount of arrays you have. That, or an extremely fast way of sourcing disks. If your array is running degraded with more disks likely to fail, then waiting on budgets, politics, miscommunications, red tape, weekends, or delivery is not a good idea.

(someone on slashdot pointed out that it can be a good idea to lock the cabinet, or label them "cold swap fallback device" or such, so that no one will steal them for other purposes and say they were just spares lying around)


Hot spares are nice in that the array will not unnecessarily spend time in degraded state.

Without hot spares, the rebuild cannot start until an admin comes by with a drive, slots it in, and (often) assigns it to the array, which can take days or sometimes weeks to happen.


Hot spares can also be overvalued

Even with hot spares, rebuilding may not always kick in when you think it does, in which case hot spares are essentially cold spares. So you still need to pay attention.

You can argue that hot spares are somewhat pointless when you use RAID6, RAID60, or some other variation that can are not degraded until after more than one failure -- assuming you will intervene after the first within a sane time.

In many situations a hot spare will be kept spun down, but it can't hurt to check whether this works for you(r controller, drive, and other details to your setup).

If a hot spare is kept spinning, then it will wear slowly even though it's not being used.



For an old enough array, it can be easier and potentially smarter to build a new array and move all your data to that.

In certain cases you can do this without downtime. This often takes a little forethought, though not always. For example, consider how database replication can help.



Automatic notification is important. Being robust to four drive failures is pointless if if you only notice once the fifth fails.

It's always useful to set up a cronjob that signals when RAID is in a bad state, be it via mail or some other way.

It can also be useful to have it send 'everything is okay' messages. It would suck to be notified of some mail settings changed though noticing your array failed.



A degraded array will stay degraded for a while. Degraded means 'can fail completely'.

It is not unimportant that when you put in a new disk, the rebuild will take a while (at least a few hours, sometimes a few days. Assume no more than 0.2TB/hour), and the array will stay degraded until it's finished.

This is one reason RAID6 is more comfortable than RAID5.


A rebuild is the likeliest time to find some bad sectors you didn't know about yet, particularly for larger arrays.

So scrub.



It can be useful to do some drive testing up front.

It looks like there is a decent correlation between errors early on, and general lifetime.

So if you have the time, you can opt to be a little paranoid: push them very hard for a while (maybe a week? Maybe more?), and only then add them to an array.

Be careful of replacement drives. If they're refurbished, they probably won't last long.

There are DOAs (Dead On Arrival) even with new drives.

See reports like "Failure Trends in a Large Disk Drive Population"

On throughput and latency, errors, drive choice, vibration, and more

Semi-sorted notes

On power

The OS on or off the RAID array

It's usually easier to boot from a drive separate from the array.

Yes, that separate drive is a single point of failure for boot, particularly if the RAID drivers/config is hard to get, at or reproduce.

But so is the RAID array once it fails.


There are some conventional tricks related to this choice,

These days, SSDs make sense, since they can be very long lived, particularly if you can keep writes (e.g. logs) elsewhere.

\The OS disk can be the cheapest, smallest drive you can find, and you can buy another to duplicate it onto, kept spun down until you need it.

With software RAID, you can partitioning all member drives the same way, and cloning the OS partition to all. This may not be as easy as it sounds. But it may offer more flexibility than hardware RAID on the same disks can.

Partitioning or not

Semi-sorted

Spindown and power management

Why

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Spindown will save a few Watts of power, per disk.


In laptops it will reduce the likeliness of drive damage due to shocks.


Even in desktops / servers it may may help drive lifetime -- but the difference may be less significant than you may think.

...according to lore:

  • colder drives are happier drives - though note the google study suggests this only starts mattering above ~40C
  • Spindown on RAID spares is probably a good idea, as they are typically not used for a year
be sure the controller understands this, and won't time them out too soon.
ideally, check the disks periodically
  • fast spindown on active drives seems to be a bad idea, in that drives are likelier to fail when spinning up (suggesting this helps cause the [1] problem: having to much friction to spin up).
If true, that argues for a medium-to-long delay before spindown. And, if you have multiple drives, putting all your "won't use often" stuff on one disk.
If both 'less wearing when idle' and 'less wearing by amount of spindown/ups' are grounded in truth, then for maximum lifetime you need to consider how frequently the drive is accessed.


Side note on the last: keep in mind background services like cron, and updatedb updating is list, at least once per day, samba regularly updating its browse.dat (when debugging, strace -efile can be your friend).

noatime can matter a little

Placing unimportant things on a tmpfs mount can help.

Check spin state

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
smartctl -i -n standby /dev/name
-i because it's part of its general output,
-n standby because you probably want to prevent this check from spinning it up (see notes below)

Example (for all drives):

ls /dev/sd? | xargs -t -n 1 --replace=@ sh -c "echo @; sudo smartctl -i -n standby @ | egrep -i '(power|model|serial)'"


Example output:

/dev/sda
Device Model:     HGST HDN724030ALE640
Serial Number:    PK2234P8JZBB5Y
Power mode is:    ACTIVE or IDLE

Values seem to be:

  • ACTIVE or IDLE - spun up; normal operation (note that for most platter disks, IDLE is still spinning)
  • STANDBY - seems to be "spun down, will spin up if accessed" mode
  • SLEEP - seems to be "spun down and won't respond to anything other than a reset" (though this may vary by driver/kernel?)


These are also values to -n, which say "don't query drive if it is in this state":

  • -n sleep: skip if in SLEEP mode
  • -n standby: skip if in SLEEP or STANDBY modes
  • -n IDLE: skip if in SLEEP, STANDBY, or IDLE

This to prevent spinup caused only by smartctl. (To ensure this, you may also need to use -d, otherwise device autodetection may still have the side effect of spinning up the drive)


Alternatively, hdmparm -C, but some people mention that wakes up their drive, and you can't really prevent it, which sort of negates the whole point.


See also