Computer data storage - RAID notes

From Helpful
Jump to: navigation, search
Computer data storage
These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.


The various types of RAID, and what they do (and don't do)

Official RAID levels

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Smin is the size of the smallest disk involved (In most cases you would use same-sized disks).

Usually the most interesting among the classical set are:

Type Basic description Read performance Write performance Space when using n disks Robust to.... Useful for
RAID0 Striping: Data is chunked up and spread among all drives (no redundancy) more throughput,
decent latency (verify)
more throughput,
decent latency (verify)
n×Smin nothing Highest speed, zero safety.

Any one drive failure takes a bite out of all data. Useful where processing speed is paramount, and data ca be easily restored / generated again.
Latency increases slightly with more disks.
RAID1 Mirror: Keeps exact copy of data on 2 drives (or more) same/more throughput
good latency (verify)
same throughput
decent latency (verify)
Smin (n-1) disk failures
(data is fine as long as one disk still works)
No extra space, little extra speed, quite safe.

Note 3-disk mirrors are safer than 2-disk (because details). Latency and read speed can be better than one disk's, but need not be - depends on implementation and loads.
Write latency and speed can be faster than a single disk when you're not saturating its bandwidth (since the controller can choose to postpone the other disk(s)'s write(s), though this is less safe without battery).
Size is colored orange because this is expensive per space.
You can't make larger arrays without nested levels.


RAID5 Striping plus distributed parity more throughput, good latency (verify) more throughput, some latency cost (verify) (n-1)×Smin 1 disk failure Increased space, speed, and safety.

Good for basic safety while not giving up much space.
Best for a handful of disks - fewer and more of your space is parity, more and safety reduces.
Any one drive failure will means no data loss.
May deal better with concurrent access than RAID3, RAID4.
Write speed scales less than linearly because of parity overhead (as does read speed when you actually check it's consistent -- which most don't do and leave up to scrub-time checks). For non-sequential workloads, parity RAID is often noticeably slower than e.g. RAID10, which for a handful of disks can be the better choice.

RAID6 Striping plus distributed parity more throughput, decent  latency (verify) more throughput, some latency cost (verify) (n-2)×Smin 2 disk failures Basically RAID5 that is robust to two drive failures instead of one, which makes more sense on larger arrays

Slightly slower than RAID5.


Notes:

  • Yes, RAID2, 3, and 4 are part of the official list.
You can consider them variants RAID5, most of them less pragmatic at least in general-purpose RAID, though there are some specific modern uses
  • When you care about IOPS
    • accurate calculations are more complex than you think
    • ...but ballwarkwise, all of them scale roughly with amount of disks linearly, but less quickly in the safer ones. Basically, RAID0 > RAID10 > RAID5(0) > RAID6(0).
  • FAST can refer to RAID0
  • SAFE can refer to RAID1
  • BIG can refer to spanning
  • Under RAID5 and RAID6, it is typical that reads are served from data blocks, not parity blocks (yes, there are variants that optionally always read the parity to check, but this doesn't happen for most smallish-scale variants)


Unofficial RAID levels

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Unofficial RAID types, variants, and some related concepts, include:

Type Basic description Speed Space when using from n disks Robust to.... Useful for
RAID1+0 Striping of mirror-sets Compared to RAID5/6, this has better random access, latency, i.e. better IOPS, which is useful for databases and such.
Safety can scale better than RAID5, but is more expensive for the same space, and layout is a little more restricting.
There are some interesting variations of this. For example, see the details of linux md's near/far variants.
RAID0+1 Mirrors of stripe-sets Similar to RAID1+0, but less robust in most configurations - because usually the stripe sets will be larger, and they fail as a whole. (verify)
RAID5+0 and RAID6+0 stripe-set of parity-striped sets Stripes between two (or more) RAID5/RAID6 sets.

Basically a RAID5/RAID6 variant better suited to arrays of 10-20 disks or more, because it lets you have more parity disks than having a RAID5 or RAID6 of the same number of disks.
RAID5+1, RAID6+1 mirror set of parity-striped sets More robust than most other RAID variants, but more than half of all disks are tied up in safety or redundancy, so is expensive per space.
Type Basic description Speed Space when using from n disks Robust to.... Useful for


JBOD 'Just a Bunch Of Drives' varies with use all nothing Using drives as-is. Not RAID.
On RAID controllers, this refers to the option to see and use drives separately instead of exposing the array as a device.
Sometimes refers to splitting files onto multiple drives to get RAID0-like speed/latency improvements when all these files are in common use.
Takes work to do well and offers no safety.
spanning, a.k.a. linear, SPAN, BIG drives are appended to look like single drive Gives you all the space of all drives together, and doesn't require drives to be of the same size.

No robustness - fails as a whole (though in theory, contiguous files can be picked off the disk)
No striping to help throughput (may be faster if multiple requests's data happens to come from distinct drives, but there is no planning that).

multipath multiple cables/ports communicating with one drive Primarily robustness to failure of things beside the disks themselves. Depending on the actual hardware used, can be made to deal transparently with failure of one cable, switch, or controller.
Some implementations will actively use the redundant IO paths (when it helps throughput).
SLED 'Single Large Expensive Drive' Single drives that store a lot of data, and usually are also physically larger than typical drives. Not very common, partly because RAID scales better.
MAID 'Massive Array of Idle Disks' Near-online (a.k.a. nearline): Having many storage devices that are off, possibly just powered off, or possibly on shelves somewhere. Can be accessed within a time that is reasonable for their use (e.g. hours).
Primarily useful for longer-term archival storage, as most storage lasts longer when not kept running continuously. Also sometimes done just to save power.


RAIN 'Reliable/Redundant array of independent/inexpensive nodes' Using a many-host cluster in a RAID-like way. Details, and particularly failure modes, vary. Often have a filesystem interface rather than a block interface.


Type Basic description Speed Space when using from n disks Robust to.... Useful for
RAID1E Keeps two copies (not more?(verify)) on adjacent disks (E for enhanced)
The main difference with official RAID1 is that data is assigned to disks in a round-robin way, meaning you aren't tied to an even number of disks.
RAID5E Variant that actively uses its hot spare for data (to get its extra IO bandwidth).
Keeps enough unused free space (at the end of each drive) so that a degraded array can be rebuilt into the drives that are left. In terms of safety it is similar to RAID5 with a present spare -- except that it will only be robust to a non-spare failure once it it rebuilt into the given drives (which may easily take a day or two)
RAID5EE Like RAID5E, but with the space throughout the array instead of at the end of each disk.
RAID6E Like RAID5E, but with two drives' worth of reserved space instead of one.


See also:


Things that don't make much sense:

  • RAID0+5, RAID0+6 - RAID5 or 6 at the top, each member of which is a stripe set. Each such stripe set fails as a set, so failure probability is somewhat higher than in even plain RAID5 or 6 without the striping. Also doesn't help performance much.
  • RAID1+5, RAID1+6 - RAID5 or 6 at the top, each member of which is a mirror set. Each such mirror makes it safer than plain RAID5 or 6. But doesn't do much for speed. RAID1 doesn't necessarily use both disks.

Optional RAID features

Hardware RAID, software RAID, fakeRAID

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Hardware RAID refers to hardware which is functionally self-contained: dedicated processor, presents only the resulting logical storage.

Software RAID refers to doing everything from the OS (management, calculation) and on your CPU.

FakeRAID is basically is hardware that turns out to do software raid: It has ports for disks, but no processing power.


People compare these on roughly two or three fronts:

  • convenience:
hard: your OS needs to know nothing about this RAID, all it sees the resulting drive
soft: your OS needs to be told about the drives. The individual ones are also visible. You possibly can't boot off it.
fake: you generally need to install driver/program to even see the logical drive. Which may not exist for your OS. You probably can't boot off it.
  • performance
hard: dedicated CPU, bus to disk, read caches, and write buffers tend to mean it's easier to guarantee stabler and often higher speeds
soft: can be as good as hard, but more easily affected
fake: mostly the same as soft


A sort-of-third is that hardware can be a little safer, e.g. around OS crashes. This depends a lot on implementation and sometimes even goes the other way.


Also, a counterargument is that each hardware RAID controller is a single point of failure that software RAID usually is not, in that the hardware RAID model may be a unique implementation, and if it fails you need to have a spare controller or risks not being able to read perfectly fine data (without sending it off to a data recovery lab).

You can argue that this is actually a case where fakeraid is sometimes accidentally best. Sort of maybe.



External RAID

There are RAID boxes that are networked (NAS) - but also those that can be connected via eSATA (so DAS - Direct Attached Storage), USB3, or FireWire.

There are fancy SAS solutions that are easier and better for rack-like solutions.


Note that to a controller, eSATA don't really look very different from internal drives. Something similar often goes for SAS enclosures.


SATA RAID enclosures are nearly equivalent to putting the drives in your computer, but you don't have to worry about details like their power supply, having enough physical space for the drives, trays, and such.

They tend to consist of little more than a small case and a power supply, and either:

  • a RAID controller inside, exposed via eSATA (and possibly more, but it seems that USB3 is proving to be slower than eSATA)
    • is more portable - whatever has an eSATA plug sees it as a single large drive
    • apparently often limited to 16TB (hardware?, firmware?)
  • ...or a SATA port multiplier that lets you connect a few drives via one SATA cable. Often hooked up via (one or more) eSATA plugs
    • You often have to use a specific controller card (in part because simple SATA controllers may not understand port multipliers)
    • You often have to use their specific software
    • Can't really be taken along and plugged into another computer.
    • Size and speed limitation is up to the controller card you get.
    • Note that port multipliers typically connect at most 4 or 5 drives (apparently for speed reasons, many can serve more. This seems one reason many products have up to 4 or 5 bays).


RAID-Z

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

RAID-Z is the software RAID present in the ZFS filesystem. It does analogues of RAID1, RAID0, RAID5, RAID6, (one variant with even more parity), and nesting.

It differs in implementation, though. Some differences that relate to your data:

  • it always checks data from disks. When there is some source of redundancy, it can very typically self-heal (...when the source of errors is the disk - UREs and UDEs).
This in contrast to a lot of current hardware RAID, which while reading assume that the non-parity blocks are correct, and has to in some cases (RAID5 rebuild with one disk missing, RAID6 rebuild with two disks missing). When you care about data integrity, that sort of assumption is half the issue to start with, so you may well care.
  • It uses a bit more metadata (because of ZFS, and because of variable stripe size)
  • It tries to avoid the unnecessary parts of the read-calculate-write cycle that RAID5 and 6 do. This means it can be safer (no [RAID#Atomicity:_including_parity_inconsistency_due_to_system_crashes write hole]) and write faster (certainly than software+fakeRAID RAID5/6), though not always (can be slower the pool is almost full(verify))

RAID uses, RAID limitations

RAID is an uptime measure, not a safety measure

RAID's design case was resistance to drive dropping out.


As a side effect of having many disks, you can also make larger and faster filesystems, and for scratch space this can even be an interest more important than safety -- which is why the completely unsafe RAID0 is used a bunch.


But adding more disks naively will increase risks -- basic probability, really -- so RAID is not backup, and it is not even a good form of implementing backup until you can show the math behind why your specific specific setup is resistant to all the predictable failure modes.


Yes, RAID gives some resistance to UREs, that is, sectors signaled as "data disappeared, sorry"). Decently behaved RAID can behave better around UREs than than a single disk with UREs, but don't count on it: a lot of implementations are fairly passive and trusting about this.

In other words, don't confuse fault tolerance with data integrity


Other limitations include

that the URE rate of your disks limits the amount of disks you should put in one array,
that it cannot provide data integrity (it does help, though),
UREs are not noticed in time without scrubs (and 'in time' is balanced with how often you do do scrubs)
does not deal with UDEs that led to silent corruption

Most RAID does not guarantee correct reads

the need for scrubs

Games of statistics

Corruption will probably happen, and will probably be silent

What about my filesystem?

Uptime doesn't mean uninterrupted good data

If software trips over data that was corrupted, then the fact that the disk array is happy is pretty irrelevant.


Once an error makes it through, what happens will vary.

Software that pays no attention may just deal with corrupt data, and do anything from never notice to crash
If it appeared in fs/db journaling, it will probably be noticed and worked around somehow.
if it appeard in filesystem metadata, the OS may decide things are Very Bad and, for example, remount read-only,

which tends to have an effect on many services.

The need for batteries/NVRAM

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Plan for failure

Disks will fail.

Whether it's after two or five or ten years, they will fail. The longer you stay optimistic, the more you are doing a Russian data roulette thing.


Since adding drives actually means adding more individual points of potential failure, RAID(1, 5, 6, and most hybrids) tempers the odds against random drive failure by allowing one or more drives to drop out without interrupting service.


It does not protect against user error, malicious intent, firmware bugs, software bugs, or anything else that changes/corrupts the data, and it cannot help against bad sectors as much as you'ld like.


When we say backup, we usually mean archival backup: A whole copy or two of everything. From long enough ago that, say, a a PhD student who deleted their thesis can make their way to the helpdesk and through the red tape in time to get a somewhat recent copy.


Other observations

There is a subtle difference between unpredictable failure and aging drives.

Since this is hard to separate in practice, we usually don't care. Broken is broken.


However, installing 40 drives of the same model and from the same production batch makes it likelier that many will fail within a year of each other (depending on quality. I've worked with bad enough drives that they failed without much pattern), which puts more pressure on your spares cabinet.


You need a spares cabinet, and you need it stocked well enough for the amount of arrays you have. That, or an extremely fast way of sourcing disks. If your array is running degraded with more disks likely to fail, then waiting on budgets, politics, miscommunications, red tape, weekends, or delivery is not a good idea.

(someone on slashdot pointed out that it can be a good idea to lock the cabinet, or label them "cold swap fallback device" or such, so that no one will steal them for other purposes and say they were just spares lying around)


Hot spares are nice in that the array will not unnecessarily spend time in degraded state.

Without hot spares, the rebuild cannot start until an admin comes by with a drive, slots it in, and (often) assigns it to the array, which can take days or sometimes weeks to happen.


Hot spares can also be overvalued

Even with hot spares, rebuilding may not always kick in when you think it does, in which case hot spares are essentially cold spares. So you still need to pay attention.

You can argue that hot spares are somewhat pointless when you use RAID6, RAID60, or some other variation that can are not degraded until after more than one failure -- assuming you will intervene after the first within a sane time.

In many situations a hot spare will be kept spun down, but it can't hurt to check whether this works for you(r controller, drive, and other details to your setup).

If a hot spare is kept spinning, then it will wear slowly even though it's not being used.



For an old enough array, it can be easier and potentially smarter to build a new array and move all your data to that.

In certain cases you can do this without downtime. This often takes a little forethought, though not always. For example, consider how database replication can help.



Automatic notification is important. Being robust to four drive failures is pointless if if you only notice once the fifth fails.

It's always useful to set up a cronjob that signals when RAID is in a bad state, be it via mail or some other way.

It can also be useful to have it send 'everything is okay' messages. It would suck to be notified of some mail settings changed though noticing your array failed.



A degraded array will stay degraded for a while. Degraded means 'can fail completely'.

It is not unimportant that when you put in a new disk, the rebuild will take a while (at least a few hours, sometimes a few days. Assume no more than 0.2TB/hour), and the array will stay degraded until it's finished.

This is one reason RAID6 is more comfortable than RAID5.


A rebuild is the likeliest time to find some bad sectors you didn't know about yet, particularly for larger arrays.

So scrub.



It can be useful to do some drive testing up front.

It looks like there is a decent correlation between errors early on, and general lifetime.

So if you have the time, you can opt to be a little paranoid: push them very hard for a while (maybe a week? Maybe more?), and only then add them to an array.

Be careful of replacement drives. If they're refurbished, they probably won't last long.

There are DOAs (Dead On Arrival) even with new drives.

See reports like "Failure Trends in a Large Disk Drive Population"

On throughput and latency, errors, drive choice, vibration, and more

Semi-sorted notes

On power

The OS on or off the RAID array

It's usually easier to boot from a drive separate from the array.

Yes, that separate drive is a single point of failure for boot, particularly if the RAID drivers/config is hard to get, at or reproduce.

But so is the RAID array once it fails.


There are some conventional tricks related to this choice,

These days, SSDs make sense, since they can be very long lived, particularly if you can keep writes (e.g. logs) elsewhere.

\The OS disk can be the cheapest, smallest drive you can find, and you can buy another to duplicate it onto, kept spun down until you need it.

With software RAID, you can partitioning all member drives the same way, and cloning the OS partition to all. This may not be as easy as it sounds. But it may offer more flexibility than hardware RAID on the same disks can.

Partitioning or not

Semi-sorted

Spindown and power management

Why

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Spindown will save a few Watts of power, per disk.


In laptops it will reduce the likeliness of drive damage due to shocks.


Even in desktops / servers it may may help drive lifetime -- but the difference may be less significant than you may think.

...according to lore:

  • colder drives are happier drives - though note the google study suggests this only starts mattering above ~40C
  • Spindown on RAID spares is probably a good idea, as they are typically not used for a year
be sure the controller understands this, and won't time them out too soon.
ideally, check the disks periodically
  • fast spindown on active drives seems to be a bad idea, in that drives are likelier to fail when spinning up (suggesting this helps cause the [1] problem: having to much friction to spin up).
If true, that argues for a medium-to-long delay before spindown. And, if you have multiple drives, putting all your "won't use often" stuff on one disk.
If both 'less wearing when idle' and 'less wearing by amount of spindown/ups' are grounded in truth, then for maximum lifetime you need to consider how frequently the drive is accessed.


Side note on the last: keep in mind background services like cron, and updatedb updating is list, at least once per day, samba regularly updating its browse.dat (when debugging, strace -efile can be your friend).

noatime can matter a little

Placing unimportant things on a tmpfs mount can help.

Check spin state

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

smartctl -i -n

-i because it's part of its general output,
-n because it prevents spinup due to this query (see notes below)
sudo smartctl -i -n standby /device/name

Example (for all drives):

ls /dev/sd? | xargs -t -n 1 --replace=@ sh -c "echo @; sudo smartctl -i -n standby @ | egrep -i '(power|model|serial)'"


Example output:

/dev/sda
Device Model:     HGST HDN724030ALE640
Serial Number:    PK2234P8JZBB5Y
Power mode is:    ACTIVE or IDLE

Values seem to be:

  • ACTIVE or IDLE - spun up; normal operation
  • STANDBY - seems to be "spun down, will spin up if accessed" mode
  • SLEEP - seems to be "spun down and won't respond to anything other than a reset" - though this may vary by driver/kernel


These are also values to -n, which say "don't query drive if it is in this state":

  • -n sleep: skip if in SLEEP mode
  • -n standby: skip if in SLEEP or STANDBY modes
  • -n IDLE: skip if in SLEEP, STANDBY, or IDLE

This to prevent spinup caused only by smartctl. (To ensure this, you may also need to use -d, otherwise device autodetection may still have the side effect of spinning up the drive)


Alternatively,
hdmparm -C
, but some people mention that wakes up their drive, and you can't really prevent it, which sort of negates the whole point.


See also