Computer data storage - General & RAID performance tweaking

From Helpful
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Computer data storage

General performance concepts

General speed tuning

Drive, driver, and OS-level tweaking

tl;dr

The defaults are decent for single platter drives. If you have one or more of:

  • an array combining to have more bandwidth than your usual drive (mostly meaning 'RAID that does striping')
  • primarily sequential needs
  • and/or programs doing bursts of many tiny accesses

...then tweaking can help in throughput, and/or in in time in blocking IO.

Suggestions

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
  • When you have a lot of small requests, or a faster-throughput or lower-latency device (than typical drives), then it is quite possible that the lowish default of IO requests allowed to be queued at a time is keeping performance down.
If most of your requests are large sequential reads/writes this won't matter
If many of your requests are small and/or randomly positioned, it will easily help to try a value like 1024 up to a few multiples of that.
I general, at some point the only effect of higher numbers is pointlessly high amounts of RAM used for buffers (that can't be used for anything more useful).
  • dirty_background_bytes is often best set to a size the device can comfortably write within a few seconds.
Its default is appropriate for basic platter drives
...so if you have RAID or SSD then it is useful to increase this.
  • consider setting dirty_bytes two or three times as high as dirty_background_bytes, because you probably don't want this threshold to trigger during typical writing (this avoids block-all-process behaviour).
Sometimes you can also help avoid that behaviour by adjusting to how fast your largest writers flush their data (e.g. databases often let you do that).
  • When you commonly have fast sequential writes on fast drives (e.g. RAID), you may wish to lower dirty_expire_centisecs and possibly dirty_writeback_centisecs, so that you won't easily get buildup of dirty data to (in particular) the dirty thresholds. You probably still want the values high enough to write efficiently (depends on the IO patterns of your typical workloads, so can take some testing).


  • If you think it makes sense for your workload, you could increase read_ahead_kb
    • With RAID, investigate whether it does clever adaptive readahead. If so, don't set this value high (or you will largely negate that cleverness). If not, setting this readahead to the stripe width can make sense (particularly when you deal with access to the same files repeatedly, or just to large ones). (verify)

vm stuff, dirtiness

Background

Some IO relates to how linux manages memory at a lowish level. For example, the concept of dirty bytes is "data in memory that needs to go to disk".

This buffer exists in part so that related writes are likely to go out together, and get sorted (and likely merged) by the IO scheduler, both of which help the efficiency of the actual writing, particularly on platter disks.

This stuff mostly comes down to thinking about how fast data ends up in these buffers, how fast the backing device can write them, and whether you prefer snappiness over potentially faster throughput.


If you want to watch how much these buffers get used, try something like:

watch -n 0.1 'cat /proc/meminfo | egrep "(Dirty|Writeback)"'

I consider these two most useful:

  • Dirty
    • ...is the collective size of pages in RAM with new/changed data which has not yet been written to disk
    • May stay there for half a minute or so before it is picked up to be written. Once it is up for writing and there is a free position in the Writeback queue (according to nr_requests), that's where it will go
    • In a mostly idle system, Dirty is often a few hundred KB, sometimes a few MB while doing some background work. With serious writing going on this may be a few hundred MB or more.
    • Apparently, dirtiable memory (the memory that can be used for this) is Free + Cache - Mapped (So basically anything not used by programs)
  • Writeback
    • The amount of data we've queued to be written to disk nowish
    • When there is a lot of dirty data, or it has been dirty long enough, it's moved here.
    • On a mostly idle system, this will be 0 most of the time, because half the point is to keep things in Dirty until we write it out in a burst
    • When writing a lot of data, the figure should ideally stay below what the backing device(s) can write within a second or two. It can be much higher, but you may not want this, so this is one thing to tune.
    • Apparently it is only guaranteed that dirty data will be scheduled for writing (put in the writeback queue?(verify)) and not that it's on disk, though it's typically there within predictable time (assuming no hardware faults).

Also keep in mind that anything that calls sync() causes IO to block until these things are empty.(verify)


The most interesting tunable here is probably the maximum amount of dirty data. Because of the underlying mechanics, it (and typical request size) are the major influences on how things show up in Writeback.


There is one bit of behaviour you want to know about: Depending on the situation, the mentioned flush will block all programs.

If you've ever done a benchmark for this with dd if=/dev/zero, or something else sufficiently fast, you will have run into the behaviour where it flip-flops between:

  • accepting data into RAM blindingly fast
  • and blocking all writing programs while it plays catch-up on disk.


Applications writing faster than your device will, fundamentally, cause this. But you can help avoid it on shorter burst, and you can cause it unnecessarily via bad tuning.


For large-chunked sequential write jobs, a large dirty-size limit has little added value. It can be be counterproductive, in that more moderate chunks would write just as fast, but with much less chance of blocking other things. (In some cases you can make a single offending program behave better. Many database engines can flush their binary logs in configurable-sized chunks)


On servers, the default 5% limit of dirtiable memory can be a lot. For example, 5% of 128GB of RAM is around 6GB, which may take half a minute to flush even on fancy RAID, so the above behaviour is something you may wish to tune to avoid.

If latency is important at all (small latency and/or predictable latency, including 'I want less choppiness in my interactive stuff'), you may want to lower the dirty limit, weighing (real-world) throughput against this sluggishness.


On desktops, You rarely see enough IO to see such blocking and a sequential write test is completely unrepresentative. Increasing dirtiable buffer size may mean you cause the flip-flop behaviour, in that it will do nothing most of the time, but occasionally churning hard for a noticeable time in which the system feels choppy. In this case you may prefer to make the intervals and sizes smaller and shorter than even the defaults, so that the work is done in more frequent and smaller chunks, each of which is hardly noticeable. Overall your system will probably feel snappier to you, at the cost of throughput of each flush (that you aren't using anyway).

Tweaking

Set in /proc/sys/vm, or via sysctl), (Note: Some details are specific to recent kernels):


  • One main tweakable is the threshold of dirty memory, which you can set to a percentage-of-dirtiable-RAM, or to an amount of bytes:
dirty_background_ratio
dirty_background_bytes
dirty_ratio 
dirty_bytes

When you set a value into a _ratio value it becomes current, and the _bytes value reads 0. And the other way around.


The dirty_background_* value controls the point at which the kernel will decide to start writeback independent of any process, without bothering (blocking) any process. This is the subtler variant of writeback, because non-IO-bottlenecked processes may be served largely when they're not asking for a sync() or such.

The value should be large enough to make for efficient writing on your hardware and scheduler, but not so large that writeback waits longer than necessary. It's probably a good idea to keep this well under the figure in dirty_*, so that it's likelier to kick in before that.


When dirty_* is crossed, a single writing process will initiate a forced synchronous writeback (exactly what is written is still up to pdflush's heuristics)(verify). In other words, the point at which all programs that write are likely to be blocked. This will make your computer seem less snappy - possibly largely paused. The length of the pauses varies with how fast the data can be written out - which varies with how much dirty data you have, and how much fits into this queue because of this setting. The point seems to be to have an enforced way to avoid dirty memory size from going out of control. Ideally, it's a last-resort thing, though it's often not hard to load a disk more than it can deal with.


Defaults are something like 5 and 10%, or 10 and 20%, or 20 and 40%. The higher values are typically older, because since then the size of RAM has grown faster than speed of disks, meaning the time to flush it all has grown and grown. Keep in mind that units of 1% are somewhat large and crude when you have a lot of RAM (~40MB on 4GB, ~600MB on 64GB). You may wish to use *_bytes for this reason.



  • Another tweakable is the interval of checking the size of dirty memory.

this is relevant because when the buffers are large enough or the writing slow enough, it is time that triggers the writeback (rather than the buffers being full).

dirty_expire_centisecs
  • how long data may be dirty before it is marked to be flushed by the next pdflush run.
  • The idea seems to be to make sure related writes can and will often be written in a burst. It looks like it's not meant to deal with sequential writes.
  • Units of .01 second. For example 3000, which is half a minute.
dirty_writeback_centisecs
  • how often pdflush wakes up (to check for data marked for writeback)
  • For example 500, meaning 5 seconds.
  • ...which is probably reasonable for most disks.
  • If you have a reliable power supply (server on UPS, laptop) and no reason for crashes (your best guess), then you could increase this. You might get slightly better write patterns, and some disks may use a little less power.
  • Could be set to 0 to disable periodic writeback, though there's usually no point to doing so.


Further notes:

  • dirty_bytes's minimum is two pages (however many bytes that is, usually 8KB?). Lower amounts are ignored.
  • Does it make sense to adjust dirty_bytes according to the backing device's write speed?(verify)
  • If you have one large writer, it can make sense to try to make sure it usually blocks itself, and not everything else.

readahead

📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.
Idea

Readahead means that when you do a read operation, the data immediately following is automatically read too, in anticipation that it will be needed soon - a spatial locality assumption.

When the program accesses bits of data that is already in memory and avoids a drive seek, that both means it won't have to wait, and it doesn't interrupt other disk jobs.


It's most relevant for high-seek-latency drives.

Consider that the slowish speed of heads means that if you moved the head, you usually might as well read the next 1MB or so. Seek-limited jobs will suck only slightly worse than they did before, but relatively sequential jobs can be much faster (there are more details to that, see below), a mix of these two patterns will generally benefit.

Consider a real-world bad case for a database: a sequential scan (helped by readahead) plus enough random access to slow everything to a crawl. The random accesses are their own cause of slowness, but the sequential scan will still be helped.

It makes for large improvements for, say, programs that will read larger structures off of disk in small accesses.

Large readahead means random accesses will cause slowness because more time is spent reading unnecessary data for each such access. (note: applications can hint to the kernel that they are doing random accesses, which means the kernel will not do readahead for that operation).

Readhahead has little effect on purely sequential reads. Readahead values over a few megabytes are usually pointless.


Implementation

Drives do this themselves, and often have a handful of megabytes themselves. In the case of linux, it looks like it pulls that immediately into the page cache. The block-level readahead setting seems to be the RAM to spend on that. (verify)

The linux default seems to often be 128KB, sometimes 768KB(verify). 128KB is conservative given modern drives. The latter makes more sense.


Setting it to 1MB or so gives you some visible improvements on mixed workloads that frequently see fairly sequential access.

Values between 2MB..16MB may help specific workloads a tiny bit more, but these values are frequently counterproductive.


In linux there are two ways to do the same thing (setting one will also be reflected in the other)

/sys/block/md0/queue/read_ahead_kb (in 1KiB units)

blockdev --setra 4096 /dev/md0 (in 512B units, e.g. 256 = 128KiB, 1536 = 768KB, 4096 is 2MB)


Relation to seek time

Seek time is effectively part of your expectation of throughput. If a seek takes 7ms on average, then you can move the head at most 140 times per second. Assuming you have time to read 1MB at each point, that means you can't expect more than 140MB/s.

You won't see that (because reading takes time too). The point here is to give an order of meaningful magnitude to the readahead size.

More than 1 or 2MB would make little sense - you're probably already seeing all improvement to sequential, and will only see diminishing returns for random accesses.


On RAID, you can multiply that figure when at the block level you're reading a bit from many disks. This depends on who is doing the readahead, and how.

In particular, if backed by RAID that does adaptive readahead, then there is no good reason for a high OS readahead, and it may negate the controller's cleverness.



For RAID

Real RAID controllers (fakeRAID often has no RAM) will often have their own read caching. Sometimes it's cleverer because it knows more, say, the amount of drives. In other cases it does not interact ideally with OS readahead. For example, intentionally setting a low linux readahead has little effect if the RAID controller always fetches more.


If you use md, it seems that both the md device and the underlying devices have distinct values for readahead. I'm not sure which would be used.(verify)

The effect(iveness) of readahead probably interacts with your chunk/stripe size (probably more relevant for software RAID?(verify))

Command queueing

Command queueing means IO commands are kept in memory for a short while before they go to disk.

This is helpful on workloads that are not both purely sequential and done in large blocks, because it allows:

  • sorting of operations, which may mean the drive can write them more cleverly (with less seek overhead)
  • merging of adjacent operations (consider programs that write a large file 1KB at a time)
  • the OS to consider per-process queueing
  • in parity RAID, it can avoid multiple recalculations of the same parity block (verify)


There are two related but distinct concepts/implementations here: the scheduling done in the OS, and that done by the drives.


IO scheduling

The linux device scheduler is a queue that keeps, and where possible and configured sorts and merges, IO commands before they go out to a controller.


This scheduling is configurable per block device. (with some schedulers effectively per-process-per-device)

(Note that this is separate from readahead and from writeback buffer(verify))


The effect a scheduler has depends primarily on device latency, and on application use patterns.


Latency: Platter disk's seek latency is more significant. Tweaking any of this on SSD has less effect than it doesn on HDD, though there are still reasons to.


Use patterns of your typical workload will determine how much the parts will help - and what tradeoffs you can do. For example:

  • something doing a lot of work in very small chunks can effectively have less data queued (given the fixed-operation-size queue). A longer queue is easily faster in such cases.
  • when queues contain more writes/reads in the same place, this tends to help
on platter primarily because it can reduce the amount of seeks a little
on SSD only if the writes are small, due to erases
  • when it also means merging, it typically reduces overhead more
though mostly for processes doing small sequential writes - if they do the same in larger chunks, or the access patterns is random, you wouldn't see much or any difference

For latency-sensitive applications you may want to lower the queue size, because each operation will go to the to disk earlier, and any pauses caused by blocking while IO flushes will be shorter. Some people might want this on their desktops. It will come at the cost of lower throughput, and more seeking overhead under IO load from different programs, but this may be quite acceptable to you.



Changing

For example:


You can change the scheduler in use, which can be better for specific types of workload.

To inspect:

cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]

To change on the fly (doesn't work on old versions of this kernel code, where you had to specify it in your boot line):

echo deadline > /sys/block/sdb/queue/scheduler


The default value for nr_requests (128) is fairly conservative for some workloads, more so when the block devices is backed by RAID.

To set a larger queue size for a device, do something like:

echo 2048 > /sys/block/sdb/queue/nr_requests



Scheduler details

Note: there have been multiple revisions of some of the scheduler, details can vary a little.

All schedulers have at least these tunables: (each also has its own, though most are less relevant than the first two here)

  • nr_requests - amount of operations allowed in the queue
  • read_ahead_kb - size of read-ahead window
  • max_sectors_kb - max request size allowed for this queue. On things like RAID it can make sense to make this higher. Cannot be larger than max_hw_sectors_kb (max transfer size allowed for this drive/driver. Tends to be a read-only value from the driver), and may be that high already

The schedulers:

  • cfq (the default in modern systems)
    • required if you want ionice (verify)
    • Sorts and merges.
    • Spends some CPU time trying to be fair, allocating timeslices for each process's queue, weighed by the process's ioniceness
    • if you want to tweak, it seems that the most interesting tunables are max_depth (), slice_idle (), queued (), quantum ()
    • (everything at idle class seems to go intro the same one queue, though)
    • tunables:
      • quantum - number of requests moved to dispatch at a time. Defaults to 4. For storage backed by many disks (RAID), this can be higher, though it can increse latency.
      • slice_async_rq - maximum asynchrounous requests (usually writes) to queue. If you increase quantum, you may also want to increase this.
      • low_latency - try to keep latency low (at the cost of throughput). Defaults to 1?
      • slice_idle - the time to wait(verify) after real work before starting on the idle queue. You can set this to 0 when you expect seeking to have minimal influence (e.g. on SSDs and many-disk SAN).


  • deadline
    • Merges and sorts
    • ionice won't work (verify)
    • adds a deadline time to each request. When nothing times out, it does sorting. When there are requests that have timed out, it serves those first to attempt to avoid starvation (also round-robin?).
    • Basically, small requests get be-smart behaviour, while large requests and loaded times get guaranteed-to-be-written-soonish behaviour.
    • mixes in more queued reads than writes. This can be tweaked (see writes_starved)
    • I'm I'm estimating this right, it's good for balancing similar types of workload, but not so good at balancing, say, an interactive one and a bulk one. (verify)
    • if you want to tweak, the most interesting tunable seems to be:
      • fifo_batch (sort of like nr_requests on a smaller scale). Smaller can be better for latency, larger better for throughput.
      • writes_starved - how many reads can be sent before a write can be sent. Lets you balance workload type.
      • read_expire - how fast to expire a read, in milliseconds from enter time. Defaults to 500?
      • write_expire - how fast to expire a write, in milliseconds from enter time. Defaults to 5000 or so. Upping the expire values means it's less likely to step away from be-smart behaviour under load(verify)


  • anticipatory (used to be the default)
    • most like deadline, but with more (potential) cleverness (tries to estimate near-future IO)
    • ionice won't work (verify)
    • may cause highish latencies
    • quite tunable to specific workloads (verify)


  • noop
    • no sorting (or only very basic?), Only merging (seemingly of already adjacent operations?)(verify)
    • Little more than a FIFO, so introduces the least overhead of all schedulers.
    • queue size only matters in a 'size of memory buffer' way (verify)
    • For loads that are already sequential and large chunks, this will probably have higher throughputs than e.g. cfq. But when you have a lot of processes, a lack of balance/fairness is more likely
    • ionice won't work (verify)
TCQ/NCQ

Drives can know their rotational position, where a read/write address sits on the platter, and where the head is.

TCQ/NCQ is a drive feature that plans the next few queued commands with that in mind.

Usually it has a mild positive effect.

When the commands are well-sorted, it may have little added effect. This seems to often be true when your OS scheduler has a large enough queue for your workload. It may even spend time effectively doing nothing, and have a mild negative effect.


The exact interaction isn't well reported, but if you expect the latter, try disabling it.

You probably wish to benchmark with it on and off -- with you particular real use pattern.


To disable, do:

echo 1 > /sys/block/sda/device/queue_depth

See also

vm stuff:

queue stuff:

Keeping things in memory

Filesystem specifics

There's always relatime (or noatime) to consider: If you have no program that requires access times on files, you can avoid a lot of writes (that are primarily seeks).

Whether the difference is negligible or significant depends on use patterns.




You can inform some filesystems of the RAID layout, which lets them plan writes better (e.g. combine writes that go to the same stripe), which for some IO use patterns matters a lot.

See e.g.




RAID speed tuning

md does its work in the CPU, so the host being very busy CPU-wise will degrade IO performance. So, say, don't run foldingathome while doing benchmarks...

Checking speed

array properties

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Chunk size

Ideal chunk size depends on use case. (it also depends somewhat on read and write cache(verify))


Ask yourself:

  • Do you primarily have large files (large data sets, video), or many smallish files (many fileservers)?
  • Do you mostly have/care about sequential access more, or random?
  • Will the array be accessed by a lot of users/services, or usually just one?
  • Do you figure you'll mostly read, or also write a lot?
  • What type of RAID do you use? (better choices here also depends on some of the above)

This can be hard to predict unless you have a specific-purpose computer, or specific-purpose array. Some things also interact with drive-level cacheing, drive readahead, and to a degree on array-level read and write caching, and even OS-level cacheing. (although the last can do little more than make specific stupid IO patterns less stupid)


A small chunk size in striped/parity raid means files are likelier to on more disks.

This is good for throughput on a singular request on larger files.

This is bad for latency on concurrent access, because each access is likelier to put head-positioning demands on more physical disks, which (on platter drives, not SSD) will increases latency of concurrent requests.

If you mostly read/write large files, then most accesses will go to all disks anyway, and the above is less relevant (though a very small chunk size will increase management overhead, that's not necessarily much).


In the case of parity RAID, consider that writes smaller than a stripe will need to read from the rest of the stripe to update the parity. Small random accesses on parity RAID are unavoidably non-ideal, though.


md settings

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The stripe cache is md's own chunk-and-disk-aware cache in the md driver. You could think of it as similar to the page cache.

It can help both read and write performance, depending on use patterns. Use of parity RAID also changes things.


If you vary its size, you'll see its effect in benchmarks, mostly on large sequential burst writes, apparently largely because a non-threaded bulk-writing program will spend less time waiting for the IO system to accept its data.

The stripe cache may be written to disk less often, which can mean that if it has randomly positioned contents, it may occupy the disk for longer at times of writing, making for more variation in latency.

At some point, larger numbers will have no effect on write speed. You could increase this until your (real-world) benchmarks show no more improvement. For small to moderate-sized arrays it is usually a value within the range 1024..8192.


If you want to see real-world use, you can watch the use of the stripe cache:

watch -n 0.2 cat /sys/block/md0/md/stripe_cache_active

Doing a huge dd or other sequential benchmark will not be very good indication; it will just show the case that fills that cache most easily.


Inspect/set the stripe cache size at /sys/block/mdX/md/stripe_cache_size. The number is the amount of entries in the stripe cache.

Best choice depends on wishes.

  • default is 256, which is relatively conservative (1MB per disk)
  • while e.g. a setting of 8196 on a 6-disk set means (6*32MB=) 192MB
  • This cache's RAM use will be (stripe_cache_size * 4KB * number of disks) (verify)


To make this permanent, you could put something like the following in /etc/rc.local

echo 8192 > /sys/block/md0/md/stripe_cache_size

(People report this doesn't always stick, though. Why?(verify))

A cleaner solution would be a udev rule. Stolen from [1]:

SUBSYSTEM=="block", KERNEL=="md*", ACTION=="change", TEST=="md/stripe_cache_size",
ATTR{md/stripe_cache_size}="8192"


To read

http://en.wikipedia.org/wiki/Mdadm http://www.devil-linux.org/documentation/1.0.x/ch01s05.html

http://tldp.org/HOWTO/Software-RAID-0.4x-HOWTO-8.html