Computer data storage - ZFS notes

From Helpful
Revision as of 13:06, 24 February 2015 by Helpful (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

ZFS is a filesystem (with its own RAID and LVM), which focuses on data integrity: it checksums all data, and can transparently self-heal where the RAID layout allows it (and since it always checks, it deals not only with UREs but also UDEs).

(Note: Yes, some of the more expensive variants of hardware RAID also do checksumming, and/or always check with available duplicates/parity. A lot of generally decent brands and decent solutions do neither, though, which moves the cause of risk to other things, such as the quality of the disks. ZFS does not trust the disks to behave perfectly, which is a good feature at today's scale)

Some argue that shoving lvm, raid, error checking, and further features into one monilithic thing breaks with the *nix "do one thing well" approach. There's certainly a point there, but the stack's cooperation gives more features, simplicity of management, and flexibility than a combination like mdadm+lvm+basicZFS could give. In particular the concept of self-healing RAID makes some people very happy.


ZFS features include:

  • integrity checking
will be able to detect all bad data blocks from disk (as in, cases where the data coming off now isn't the same as what was written there originally. Which, note, does not deal with other possible sources of corruption. See some notes below)
admins will still want to do the scrubbing (zpool scrub)
You can disable checksumming, but then why use ZFS?
  • transactioned writes
more protection at the cost of some performance. You should only ever lose a few seconds of alterations.
  • self-healing, transparently triggered by the checksum checking (i.e. always before handing data to user)
...when there is enough redundancy (or rather, enough verifiably good data) to work from, i.e. from mirror or and RAID-Z layouts
Because every fix is done on demand, ZFS has no fsck. A scrub is essentially a fsck in that it does all checking and healing it is able to (ZFS metadata is also checked, though I'm not sure how many structure checks are done, and what kinds of semi-broken metadata can be repaired)(verify).
  • multiple copies of metadata are kept (on top of possible underlying block duplication), so metadata is more likely to survive disk errors/failures than file data.
  • its own RAID
    • tries to avoid the unnecessary parts of the read-calculate-write cycle that RAID5 and RAID6 do by being file-based. Writes new blocks and retires old ones, so no write hole.
    • you get self-healing (unlike most RAID5 and RAID6 implementations)
    • may be faster than other software RAID styles. However, it may be slower once the pool is almost full(verify).
  • filesystems - what ZFS calls filesystems is basically a folder within a pool that has an unique set of enabled features, so you can e.g.
    • have quota (in 'allowed size of stuff in filesystem' way. Group and user quota also exists)
    • compression of only of your research document collection
    • 3-way duplication of only your PhD thesis data
    • deduplication of only your movies
    • ...etc.
    • see also the term dataset (a more abstract term encompassing filesystems, snapshots, clones, and more)
  • snapshots, copy-on-write style
  • clones, a similar idea to snapshots, potentially useful for VMs and such
  • zvol - basically a ZFS-backed block device
(with options of snapshotting and compression and such, like with filesystems)(verify)
  • replication - as in saving snapshots (and snapshot differences) to a file, and restore from them. And move them between pools and hosts.
basic mechanism is that
zfs send
feeds data to
zfs receive
A common tutorial example is to do this via SSH, to avoid an intermediate file, and if you had SSH set up anyway, it makes host-to-host replication a one-liner without a large temporary file
  • deduplication at file, block, or even byte level.
Can be useful if you expect a lot of users to store the same content.
Slow when the dedup tables do not fit in RAM or ARC. It's a weird workload. If you want to rely on it, then plan for it - know the fundamental limitations, and that they won't bite you.
  • some potentially clever read and write caching (with a bunch of details you may want to know)
    • The ZIL is interesting to read up on when you want performance via a hybrid SSD+HDD setup


On projects, OSes, and support:

  • Oracle's ZFS on Solaris is the original (and initially only version)
  • There are four specific-OS implementations, tied together by the OpenZFS project, which share all platform-independent changes
    • FreeBSD (fork) (was nice and stable rather before it was present on linux)
    • ZFS on Linux (fork) (replacing the earlier, slower FUSE-based version)
    • OSX Server (fork)
    • illumos (fork) (illumos being a community fork of OpenSolaris that is fully, not mostly open)

Oracle has more or less broken with this project, presumably to get you to use Solaris. Arguably you're better off with OpenZFS. Time will tell.


Limitations

  • ZFS loves memory, enough that you may wish to buy some more.
It'll quickly use 1GB for it to run any likely pool comfortably.(verify)
For large pools, it's comfortable if it gets (very roughly) 1GB of RAM per TB of stored data.(verify)
(If you use dedup, you'll probably need at least an additional 3GB of RAM per TB of data)
  • you can't remove vdevs.
...meaning certain admin mistakes are bad, so think before you sudo
  • You can't shrink vdevs or pools, only expand pools.
This means it is better to plan on the way you will expand.

ZFS is copy-on-write by nature. It makes various things safer, some things faster, some thing slower.

  • it is slower for some workloads
You probably don't want it to back iSCSI -- at least, not without a lot of reading and tweaking.
VMs are not as bad, but still interesting.
it fragments easier.
Particularly on full pools. Avoid that when you can.
Some workloads, e.g. random small alternations, will fragment even if not full
If you primarily work with databases, search indexes, and other things that pretty much implement their own filesystem in the first place, then ZFS is not necessarily the thing you can squeeze the most performance out of.
However, there are some tricks you can apply to get close while still getting ZFS's data integrity. If that combination sounds good, it's worth checking out (read up on ARC and ZIL if you do, and read that evil speed tweaking guide).


  • If you are considering ZFS for a very important server, consider:
    • checks+heals is only about the disks. That's more than most filesystems do, but...
    • If you want end-to-end guarantees, ZFS is only part of the puzzle. For example (in general) a RAID scrub with bad RAM in place can always destroy data. ECC RAM is recommended.
    • In critical high-uptime servers you are always looking to avoid single points of failure. ZFS by itself is a single point in that it doesn't cluster by itself, and/so if it fails it fails as a whole.
      • You may like it to be the filesystem underlying a distributed filesystem, though.
    • healing ZFS is still not backup
Its MTTDL is lower than a lot of similar hardware/software RAID, but the requirements of a backup system are fundamentally distinct from those of a RAID system, so only a well-designed backup system is backup.
The real MTTDL varies significantly with how you use ZFS, and with the design/quality of the hardware you use it on.
  • ZFS cannot guantee anything about the OS or hardware it's running on. Corruption can, as with any filesystem, still happen in cases of:
    • Bit flips in RAM (ECC is recommended), though some introduced errors are still recoverable
    • Badly seated RAM (tends to make for large errors)
    • bad IOs (malicious or mistaken) not isolated via IOMMU (e.g. because the IOMMU is disabled)
    • kernel bugs
    • ZFS bugs
    • disk firmware bugs
    • disk systems that ignore flushes (e.g. that acknowledge the command but postpone the actual write), and/or reorder IO operations around flushes. One example is RAID with write-cache that is not as protected as you thought, another is hard drives that increase speed by acknowledging things are on disk before they actually write it.
    • See things like https://clusterhq.com/blog/file-systems-data-loss-zfs/
    • not enough redundancy for a given issue. Consider that a misdirected write is can work out as two mistakes in one.


ARC

L1ARC is RAM used to cache recently read files. (ARC itself is basically a slightly cleverer version of an LRU cache)

It use and adaptive size makes it quite similar to what the OS-level page cache does. It can be somewhat cleverer than that cache, though it is also not quite as easily vacated (varying with OS-specific implementation, apparently most are better behaved than linux's(verify)), so when it is important that all RAM allocations always be served, and/or RAM is tight, you may want to set a limit to the ARC's size - but when you so do, be aware that too small a limit (depends on specific use case) means ZFS performance suffers.

Note that when your major applications manage their own purpose-aware caching (e.g. postgresql), it is preferable to tell ZFS to use ARC only for metadata, not file data.



L2ARC is an optional device-based read cache.

You can consider using an L2ARC when you have a subset of your data that you read frequently, too large to keep in RAM, but which e.g. fits on an SSD.

SSDs are preferred when you do this for the read performance of this frequent data, particularly if it is small random reads. An example of where it would help is a photo site's image servers: will get read-heavy workloads of many smallish reads, and there is typically a specific subset (recently added and popular images) being requested a bunch while the bulk is dormant.

There is no rule against using HDDs, it will just often be no faster than not having an L2ARC. Best performance is that of a single disk (while your pool probably has striping in it, L2ARC cannot as of this writing), you won't get lower seek time from this one disk. The best thing you can say is that you're dividing work a little, which may help keep overall seeks down a little.

Basically, do this only when you can easily explain that it will help.


On errors and failure: regular checksumming applies to ARC contents, so no errors make it through. If the backing disk(s) fail(s), ZFS transparently stops using L2ARC.



ZIL

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The ZIL (ZFS Intent Log) is basically a transaction log (similar to those in databases, if you're familiar)


It is also used as a write cache, but only for sync writes (to ensure the commit can be replayed). Non-sync writes are only in RAM (typically a few-second cache(verify)), so at crash/power failure tme you'll lose more backlog of sync than async writes. This bias makes sense, as you will always lose information, but sync writes are often only used used for correctness mechanisms, such as database journaling.

Sync writes excludes most local workstation stuff. It may also exclude iSCSI, which can be a bad idea.

NFS by default uses sync writes - may be good to know. You could alter some of your programs to mark their requests as synchronous to, effectively, get more protection via the ZIL.

Note that you can force ZFS to treat all writes as async(verify).



On place and size

By default, the ZIL is stored within the pool it serves.

This is fine for workstations because they have few sync writes in the first place, but if you have a structurally decent amount of async writes (database server, fileserver), you probably want to put the ZIL on a separate device.

Putting the ZIL on a separate device (doing so is termed SLOG, "Separate intent Log") speeds up sync writes, because you've basically created SSD+HDD hybrid storage (for sync writes)

  • these writes can go to the pool in their own time (while being safe, in the transactioned sense)
  • You don't have the added seek time from having both the ZIL write and the real write to the same disk set
  • If you put it on an SSD, then the write to the ZIL has negligible seek time on SSDs, which results in higher IOPS to multiple clients (often something inbetween the SSD's and the backing storage's), to exactly the same backing storage


You can compare it to a RAID controlling using its write buffer, backed by SSD. With some differences - ZFS will still be a little nicer to your data around failures since it's doing transaction-like stuff.


Depending on the workload, the ZIL need not be large, because it's functionally mostly just a write buffer, which rarely need to be large. Allocating 2GB of an SSD is often more than enough (and lets the SSD do its own wear leveling).

For many-client servers, specifically using an SLC SSD can make sense (verify)


On failure of device(s) backing the ZIL:

  • Earlier versions could not deal well with ZIL failure (≤v19?)
  • The ZIL content is duplicated in ARC (read: in RAM) until it is in the pool, so the ZIL failing means no data loss in itself
    • ..(except if power fails in the few seconds after the ZIL device fails, because sync writes which would normally be in the pool until flushed are now only in RAM until flushed. If that small time window matters, make the ZIL a mirror


See also:


On duplication to fight error

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


ZFS terminology

In terms of major organization:

  • Physical disks go in vdevs
  • vdevs go in pools
  • pools store your files and have overall failure/speed characteristics according to their vdev layout


On expansion, on balancing

Practicalities on pool creation

How to refer to devices

Any way you want, but keep in mind that /dev/sd? devices can change order and cause problems. It is recommended to use fixed references.


Basically: Look at the contents of /dev/disk/by-id/ and look at the drive label.

If the WNN is printed on the label, it's nice and shorter. If not, the ata-model_serial entry work just as well.

...though frankly, you may wish to buy a label printer anyway :)

Old or new sector size

  • -o ashift=9 is for old-style 512-byte sector drives
  • -o ashift=12 is for new-style 4KB-sector drives, and typically what you want
There are a few cases where bootloaders may behave better with 512


Effects of sector size, stripe size, vdev size

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/


does not contain an EFI label but it may contain partition=

Basically means it's being careful about a drive that may contain other data.(verify), so add -f ('force') like it suggests

Creating (various layouts)

ZFS always stripes across a pool's top-level vdevs.

A vdev can be:

  • a lone disk (result is like RAID0)
  • n-way mirror (result is like RAID1 if one, RAID10 if multiple)
  • raid-z1 (like RAID5 if one, like RAID50 if multiple)
  • raid-z2 (like RAID6 if one, like RAID60 if multiple)
  • raid-z3


It usually makes most sense to make all top-level vdevs the same. That is, given identical disks, you get more predictable/quantifiable fault tolerance, redundancy, and performance.

ZFS does not prohibit you from mixing different vdev types and sizes within a pool. It can make sense (e.g. when adding different-sized drives), but it also allows you to do stupid things (e.g. a top-level stripe between RAID6, RAID6, and a single disk. Since you can´t remove a vdev, such a mistake is more or less permanent, so in general, check the command you give makes sense.




To create a RAID0-like stripe, just adding a bunch of one-drive vdevs directly to the top level of the pool:

zpool create name members

Striped vdevs

  • detects errors (through basic ZFS checksumming)
  • cannot heal errors


If you want to mirror across two drives (basically RAID1) or more (very good error protection), create a single mirror vdev in the pool with all as a member.

zpool create name mirror members

Mirrored vdevs (similar to RAID1)

  • detects errors (through basic ZFS checksumming)
  • heals errors when there is at least one copy that checks out (which is usually)


If you want to stripe across mirrors (like RAID10), add multiple mirror vdevs to the zpool, e.g.

zpool create name mirror member1 member2
zpool add name    mirror member3 member4

Striped mirrored vdevs (similar to RAID10)

  • detects errors (through basic ZFS checksumming)
  • heals errors (assuming there is at lease one copy that checks out)
  • performs better than raid-z (much like RAID10 performs better than RAID5/RAID6)


If you understand the above, you'll probably find the rest pretty analogous.

Say, for a RAID-z1 stripe

zpool create name raidz member1 member2 member3 member4 member5 member6 member7

raidz, a.k.a. raidz1

  • detects errors (through basic ZFS checksumming)
  • adds parity on top of checksums (uses more disk space)
  • Similar to RAID-5 in amount of missing blocks it can stand, except that
    • ZFS guarantees a read is correct (checks and heals on the fly)
    • ZFS can deal with more than UREs (because it checks data)
    • ZFS can tell which vdev is at fault (verify) (necessary for the above)

raidz2

  • like above, can deal with two damaged blocks (similar to RAID-6)
  • (Everyday?(verify)) performance between Z1, Z2, and Z3 is apparently quite similar.

raidz3

  • like above, can deal with three damaged blocks. Yet more peace of mind. No analogous classical-RAID level.


See also:

After pool creation

By default, ZFS mounts the pool at creation time, at /poolname

If you want to move it, you'll want to know about

zfs set mountpoint=/mnt/here poolname

There are more details, of course. See things like:


If you don't care about atime, then you may want to do:

zfs set atime=off poolname


Some inspection

Pool summary:

zpool list

Health summary (will also mention things like an ongoing scrub):

sudo zpool status [poolname]

Doing a scrub (you should probably cronjob these):

zpool scrub poolname

See performance of pool, add -v to get performance of individual drives too (2 is the interval in seconds):

zpool iostat -v 2 [poolname]


Sometimes it's useful to remember the admin commands you used earlier for this pool, e.g. when trying to remember what ashift parameter you used.

zpool history [poolname]


Expectable performance

On compression

Benchmarking

Issues and errors

On dedup

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Unsorted

On ZFS pool versions