Computer data storage - ZFS notes

From Helpful
Revision as of 14:53, 10 September 2024 by Helpful (talk | contribs) (→‎On deduplication)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.
Computer data storage


When ZFS may be interesting for you

Roughly:

  • ZFS chooses safety over speed
when you care about your data and scale to more than one or two disks, this is probably the only reasonable approach
a good idea even with pricy disks but especially with cheaper disks.
(it relates to the whole "Bit Error Rate has stayed constant while arrays are getting much bigger" thing)
speed is often actually still quite decent - purely because you'll probably be adding a whole bunch of platters/channels
  • ...relatedly, it chooses to distrust disks
it verifies checksums on all reads
if it sees a mismatch it can fix (as in, enough verifiably good data from duplication or parity), it fixes it before returning the data
if it sees a mismatch it cannot fix, it tells you by failing the read
...note that none of the above three is a necessarily guaranteed on any given classical RAID
note that this does not deal with all possible sources of corruption, but more than various other solutions
As fixes are done on demand, ZFS has no fsck. A zpool scrub is close enough, and is why you should schedule these
multiple copies of metadata are kept
so filesystem integrity is (more) likely to survive sector failures errors/failures (than file data, unless you choose to duplicate that too)
transactioned writes - journaling, a little more so than some implementations.
More protection at the cost of some performance
  • can do stripe/mirror/parity stuff
avoid some unnecessary parts of the read-calculate-write cycle that RAID5 and RAID6 do by being file-based.
no write hole issues since it's copy on write (writes new block, then retires old one)
causes slowdown when almost full
yet generally a little faster than simpler software RAID
  • can do transparent compression


  • can do quota
  • can do LVM-like things
  • can do snapshots
however, the snapshots have practical limitations


  • replication - as in saving snapshots (and snapshot differences) to a file, and restore from them. And move them between pools and hosts.
basic mechanism is that zfs send feeds data to zfs receive
A common tutorial example is to do this via SSH, to avoid an intermediate file, and if you had SSH set up anyway, it makes host-to-host replication a one-liner without a large temporary file
not so ideal at very large scales, though


  • deduplication at block level.
however, you probably don't want to use it (some discussion somewhere below)


The heal-or-fail behaviour is probably the deal-maker for many, and the transparent compression a nice bonus.



Limitations / food for thought:

  • ZFS is copy-on-write by nature.
It makes various things safer
It makes various things a little slower
It even makes a few things faster
  • ZFS can be fast, but this is generally a secondary concern to to safety, whenever it is a choice
e.g. read-heavy workloads are easier than write-heavy ones
  • ZFS eats memory. Enough that you may wish to buy some more.
...in part because when you use ZFS, you're likely to be setting up a larger-than-your-average workstation storage.
How much? TODO:(verify)
For people wishing to run it on their old hardware: With tweaking it can be made to run, but know what you're doing. Things likely will stutter more.
For decent responsiveness on a pool, it'll quickly use 1GB, be more responsive with 4TB, preferably 6TB (partly people-repeating-each-other figures, but e.g. below 4TB it disables vdev prefetch)
For large pools, add some RAM scaling with the pool size - 1GB of RAM per 1TB of actively used data is nice
(If you use dedup, you'll probably need at least an additional 3GB of RAM per 1TB of data)
  • you can't shrink storage, and you can't remove vdevs from a pool
...the latter means certain admin mistakes are bad, so think before you sudo
  • It's software RAID in the end. For CPU-constrained hosts, this will mean speeds below raw disk speeds.
For servers this can be negligible, but it can be an issue.
  • the snapshot thing seems designed for archival backup, not restore speed. Fetching back files can be involved and slow, particularly for many-TB storage


  • It can be argued that combining LVM, RAID, error checking, quota, compression, etc. into one monolithic thing thing breaks with the *nix "each program does one thing well" philosophy".
other argue "not so much, when these are all aspects of the same thing (storage), and the parts often integrate with others better than they would do as independent pieces".
This probably should stay a lively discussion, because both have good points.



  • some potentially clever read and write caching (with a bunch of details you may want to know)
The ZIL is interesting to read up on when you want performance via a hybrid SSD+HDD setup


You probably don't want it to back iSCSI, for speed reasons. At least, not without a lot of reading and tweaking.
VMs are not as bad, but still interesting.
it fragments more easily
Particularly on full pools. Avoid filling above 80% when you care about speed (true for most filesystems)
Some workloads, e.g. random small alternations, will fragment even if not full
If you primarily work with databases, search indexes, and other things that pretty much implement their own filesystem in the first place, then ZFS is not necessarily the thing you can squeeze the most performance out of.
However, there are some tweaks and tricks you can apply - you can get good performance but still get ZFS's data integrity. If that combination sounds good, it's worth checking out (read up on ARC and ZIL if you do, and read that evil speed tweaking guide).


  • If you are considering ZFS for a very important server, consider:
    • Its MTTDL may be lower than a lot of similar hardware/software RAID, and more controllable, but the requirements of a backup system are fundamentally distinct from those of a RAID system (RAID is not backup) -- only a system designed with all backup-related issues in mind is a backup system.
    • The real MTTDL will vary significantly with how you use ZFS, and with the design/quality of the hardware you use it on.
    • In critical high-uptime servers you are always looking to avoid single points of failure. Most filesystem are single point of failure, and since ZFS itself doesn't cluster, it is no different. You may like it to be the filesystem underlying a distributed filesystem, though.
    • healing ZFS is still not backup
  • ZFS cannot guantee anything about the OS or hardware it is being run on -- its checks+heals is only about the disks. If e.g. data corrupts before it's handed to ZFS, ZFS will dutifully store that. So ZFS is only part of the puzzle. Corruption can, as with any filesystem, still happen in cases of:
    • Bit flips in RAM. A few errors are still recoverable, but ECC is recommended if you really like your data.
    • Badly seated RAM (tends to make for large errors)
    • bad IOs (malicious or mistaken) not isolated via IOMMU (e.g. because the IOMMU is disabled)
    • bugs in the kernel, in ZFS, in disk firmware, in controller firmware/drives
    • disk systems that ignore flushes (e.g. that acknowledge the write command but postpone the actual write), and/or reorder IO operations around flushes. One example is RAID with write-cache that is not as protected as you thought, another is hard drives that increase speed by acknowledging things are on disk before they actually write it.
    • See things like https://clusterhq.com/blog/file-systems-data-loss-zfs/
    • not enough redundancy for a given issue. Consider that a misdirected write can work out as two mistakes in one.




Some history - projects, OSes, and support:

  • ZFS on Solaris is the original (and initially only version), v1 though v28
  • There are four specific-OS forks
    • FreeBSD (was nice and stable rather before it was present on linux)
    • ZFS on Linux (replacing the earlier, slower, FUSE-based version)
    • OSX Server
    • illumos (illumos being a community fork of OpenSolaris that is fully, not mostly open)
  • After oracle bought Sun (in 2010), ZFS v29 onwards is closed source and essentially broke with the open source projects
  • This initially made said efforts a bit uncoordinated, until they chose to tie their efforts more closely via the OpenZFS project. The idea there is to share all platform-independent changes and each makes them work on their own OS, so to be functionally equivalent.


Time will tell whether you're better off with Oracle's 'official' ZFS on Solaris or with the OpenZFS variant of your preference.

Practicalities on pool creation

ZFS terminology

  • filesystems - what ZFS calls filesystems is basically a folder within a pool that has an unique set of enabled features, so you can e.g.
have quota in an 'allowed size of stuff in filesystem' way (group and user quota also exist)
have compression of only of your research document collection
have 3-way duplication of only your PhD thesis data
have deduplication of only your users's home directories
...etc.
see also the term dataset (a more abstract term encompassing filesystems, snapshots, clones, and more)
  • snapshots, copy-on-write style
  • clones, a similar idea to snapshots, potentially useful for VMs and such
  • zvol - basically a ZFS-backed block device
(with options of snapshotting and compression and such, like with filesystems)(verify)


In terms of major organization:

  • Physical disks go in vdevs
  • pools are backed by a set of vdevs
  • pools store your files and have overall failure/speed characteristics according to their vdev layout
and pools optionally subdivide into what they call filesystems (but are mostly functional)


On expansion, on balancing

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Since ZFS also implements the LVM side of things, it can easily expand to more drives.

But

  • expanding isn't always practically sensible.
  • you ca never shrink


There are two ways to increase the space in a pool:

  • add a vdev to the pool
Keep in mind that you can never remove a vdev from a pool. That means you can accidentally make nonsensical configurations (e.g. weakening safety), which you can't fix with anything less than creating a new pool and moving all your data there.
  • replace all disks within a vdev with larger ones, resilvering inbetween each replacement(verify))
Slow and tedious, but it works.


On balancing data between vdevs:

  • In terms of safety, it balances new data between top-level vdevs (in RAID0-style)
so has bearing on speed
but very little bearing on redundancy/safety, in that the risk lies in the failure of a top-level vdev.
  • Rebalancing:
You cannot explicitly ask ZFS to rebalance existing data in the pool (and this feature probably won't materialize)
...though because of the copy-on-write nature of ZFS, anything that involves (re)writing data will spread according to the pool at that time. New data is spread. Data that is altered will spread.
so if you really really want to balance some files, you could make a copy of it and remove the original.
...and yeah, data that was written once before expanding and never altered after will stay on the original vdevs.
if you copy-a-lot-of-data-then-delete, you may fill the pool enough to start fragmenting it
  • ZFS will write more data to vdevs that have more space, so will balance on the long term
(and rebalance if/when data gets rewritten)
also means nice behaviour when you use disks of varying sizes

How to refer to devices

Any way you want, but keep in mind that /dev/sd? devices can change order and cause problems. It is recommended to use fixed references.


Basically: Look at the contents of /dev/disk/by-id/ and look at the drive label.

If the WNN is printed on the label, it's nice and shorter. If not, the ata-model_serial entry work just as well.

...though frankly, you may wish to buy a label printer anyway, just to tell which drive is which without taking them out :)

Old or new sector size

a bit more performance
apparently the freespace calculations need to be more pessimistic (by ~8%) though it's not actually lost space(verify)
https://github.com/zfsonlinux/zfs/issues/4599


  • -o ashift=9 is for old-style 512-byte sector drives
There are a few cases where bootloaders may behave better with 512

Effects of sector size, stripe size, vdev size

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/


does not contain an EFI label but it may contain partition

It's being careful about a drive that might possibly contain something.


If you the drive contains nothing you want to keep, so are sure you want to wipe, use -f meaning force.

Actually creating (various layouts)

Some words on layouts
  • A single vdev can be:
    • a lone disk (RAID0-like if multiple)
    • n-way mirror (result is like RAID1 if one, RAID10 if multiple)
    • raid-z1 set (like RAID5)
    • raid-z2 set (like RAID6)
    • raid-z3 set
  • You cannot add members to existing raidz later.
  • You can add members to an existing mirror later (see attach/detach)
also applies to lone disks, which is one way to replace it (make it a mirror, wait to sync, detach the old disk)


  • ZFS always stripes across a pool's top-level vdevs.
so analogues to most classical RAID variants exist as
RAID0: all disks at top level
RAID1: add a single mirror at top level
RAID10: add multiple mirrors at top level.
RAID5/RAID6: add a single raidz1/raidz2 at top level
RAID50/RAID60: Add multiple raidz1, raidz2 at top level
  • ZFS allows any mix of vdev types at top level
including stupid ones. You can e.g. make a top-level stripe between (a 20-disk raidz3 set) and a single disk
since you can't remove vdevs, such a mistake is more or less permanent, so as always, think before you sudo
so it usually makes most sense to make all top-level vdevs the same - in that the fault tolerance, redundancy, and performance are more predictable/constant


  • raidz1 is risky for the same reason RAID5 is: while one disk is out, you have no protection and are at the mercy of the bit error rate.
statistics says:
most of your data will be okay, but a random tiny part of is not.
How much has always depended largely on disk size, because BER has stayed mostly constant (while sizes have grown)
  • if you care about space more than anything, then RAIDZ2 and RAIDZ3 are pretty nice up to a dozen disks, striping between multiple beyond that
do read up on ZFS-specific details on most efficient setups, risk calculations, etc.
  • If you care about performance being constant, more than about maximum space, then consider striping over mirrors. Yes, this is basically the old RAID10-over-RAID[56] argument:
    • in general operation, mirrors are faster than raidz, raidz is faster than raidz2 and raidz3 (verify)
    • mirrors are also faster to resilver, meaning less time spent in a fault-sensitive state. Yes, raidz2 and raidz3 avoid that fault-sensitive state but also rebuild much slower (you may not care)
  • Middle of the road is striping over smallish raidz
  • if you care about a system that is both fault sensitivity and very fast, striping over three-way mirrors is probably the best choice
...though note that this only makes sense on very-high-speed networks
  • fast is relative
in a server, disks faster than your network is pointless
use of many disks scales speed of some operations



Create commands

A RAID0-like stripe

  • detects errors (through basic ZFS checksumming)
  • cannot heal errors
  • can be extended later
  • create by adding a bunch of one-drive vdevs directly to the top level of the pool:
zpool create poolname members


A RAID1-like mirror across two or more drives

  • detect errors (through basic ZFS checksumming)
  • heals errors when there is at least one copy that checks out (which is usually for 2-way mirrors, and almost always for ≥3-way mirrors. Keep in mind that with a disk failure it reduces to a one-less-disk mirror, which is why people do 3-way if their data is pretty important)
  • create as a a single mirror at top level:
zpool create poolname mirror members


A stripe across mirrors (like RAID10),

  • detects errors (through basic ZFS checksumming)
  • heals errors (assuming there is at lease one copy that checks out)
  • performs better than raid-z (much like RAID10 performs better than RAID5/RAID6)
  • note that not any two disks can fail
  • create by adding multiple mirror vdevs at top level, e.g:
zpool create poolname mirror member1 member2
zpool add    poolname mirror member3 member4


A RAID-z1 stripe

  • detects errors (through basic ZFS checksumming)
  • adds parity on top of checksums (uses more disk space)
  • Similar to RAID-5 in amount of missing blocks it can stand, except that
    • ZFS guarantees a read is correct (checks and heals on the fly)
    • ZFS can deal with more than UREs (because it checks data)
    • ZFS can tell which vdev is at fault (verify) (necessary for the above)
zpool create poolname raidz member1 member2 member3 member4 member5 member6 member7

raidz, a.k.a. raidz1


raidz2

  • like above, can deal with two damaged blocks (similar to RAID-6)
  • (Everyday?(verify)) performance between Z1, Z2, and Z3 is apparently quite similar.


raidz3

  • like above, can deal with three damaged blocks. Yet more peace of mind. No analogous classical-RAID level.


See also:

After pool creation

Sometimes it's useful to remember the admin commands you used earlier for this pool, e.g. when trying to remember what ashift parameter you used.

zpool history [poolname]


Mounting

By default, ZFS mounts the pool at creation time, at /poolname


If you want it elsewhere, you'll want to know about

zfs set mountpoint=/mnt/here poolname

This (as with any filesystem) is harder once programs have files open on it, but generally not a bother after creation. You can also do it at zpool create time, by adding -o mountpoint=/mnt/here

There are more details, of course. See things like:



There are reasons a pool can be happy, but not mounted.

I should figure out which, but it seems if setting the mountpoint didn't mount, then you can usually fix it with:

zfs mount poolname


In some cases it is useful that there is a canmount property, which can e.g. prevent automounting per dataset (values are on, off, noauto)



If you need e.g. boot to fail hard on a missing ZFS pool, consider

zfs set mountpoint=legacy poolname/fsname

Then you can use an fstab entry like:

poolname/fsname  /export/media  zfs  defaults  0  0
performance tweaking

If you don't care about atime, then you may want to do:

zfs set atime=off poolname


Other performance hints: http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance

Practicalities later

Scrubbing, monitoring

The reason to scrub ZFS is different from scrubbing classic RAID.


In RAID that doesn't check read correctness every time, regular scrubs are necessary to even notice that you have been passing through incorrect data (details depend on things like your backup strategy).


In ZFS always checks reads, that specific reason for verifying isn't there, not is the implication you prefer it on a very short term.


But the need to worry is not gone either.

Consider that if you e.g. write once and then don't read for the next decade, you will not notice deterioration.

While ZFS makes it much likelier that longer-term bit rot is correctable, data can still die. In some cases that can be avoided with scrubs, in other cases it cannot but you still want to know about it - again, probably within a time that probably relates to your backup strategy.


Since there is much less hurry, ZFS scrubs are low-priority, and will yield to regular disk use. This is why scrubs can take weeks.

Which is sometimes preferable. You can tweak it to be more aggressive if you care.

(Note that ZFS also chooses to only check files, i.e. your data rather than your disks)



Monitoring commands


Pool summary:

zpool list


Health summary (will also mention things like an ongoing scrub):

zpool status [poolname]

-v gives you more details about any errors


See current performance of pools with zpool iostat. Below, 2 is the interval in seconds:

zpool iostat [poolname] 2

-v gives you per-drive performance


Doing a scrub:

zpool scrub poolname

You should probably cronjob that.


Dealing with errors and failure

Tweaking

For databases

Expectations

sync writes versus non-sync writes

Expectable performance

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Keep in mins that ZFS does write throttling - see section below


IOPS angle:

IOPS increases with the amount of striping, so if you want more IOPS, the best way is to add lots of smallish vdevs at top level - basically RAID10 style.
(...in part because...) vdevs themselves are more or less IOPS-bound by their slowest disk (including RAIDZ, even though they do stripe).


Throughput angle:

  • more disks in striping. You can hit ~1GBy/s for sequential use with under a dozen disks. (and that may be your bus limit)


Relative performance of various vdevs
  • striping along a pool's top-level vdevs (which pools always do)
    • The best way to scale aggregate IOPS and bandwidth.
    • ...aggregate in that each program will probably not see speeds much faster than a single vdev, but it is likelier that each will go to a smaller part of the whole and interrupt each other less.


  • mirror vdevs
    • write IOPS: that of a single disk (specifically the slowest, because for atomicity a write is only done once the data and metadata is all written)
    • write bandwidth: that of a single disk
    • read IOPS: scales decently because they are round-robined (details differ between relatively sequential/random patterns)
    • bandwidth: scales decently, for the same reason
    • random performance: better (like in basic RAID10 versus parity RAID)


  • raidz vdevs (A filesystem block goes to most/all of the vdev's drives)
    • write IOPS: single disk's
    • write bandwidth: aggregate (since it's basically striping to most disks)
      • can beat mirror on sequential write for largeish arrays, because of the amount of striping (but these returns do diminish)
      • but for small writes there is little difference as seek time dominates write time
    • reads: sequential good, random bad
      • reading does not really get a round robin bonus unless you're lucky (sequential reading of previously sequentially written data?)
      • random reading is slowish (as even smaller reads read *must* come from all disks)
    • writes need more CPU than mirror since they calculate parity (but on dedicated storage host this should not be limiting)
    • (no write hole penalty)
    • in terms of amount of data touched, raidz avoids most of the write penalty that RAID 5/6 volumes have. But more importantly, the related data integrity issues.
    • At the cost of CPU time, but your storage should have these CPU cycles to spare (ZFS may use less CPU than similar software RAID because of content awareness(verify))


Notes:

  • the penalty on small writes present in classical RAID is lower in raidz for many workloads.
  • In vdevs, mirrors outperform raidz in the cases that matter to most people (as in, for many real-world loads)
    • Exception: for pure sequential write, and write-mostly loads, (and the same amount of striping,) there should be little difference between mirror and raidz (clever scheduler).
  • When doing benchmarks
keep in mind that a single program doing small operations may not saturate the pool capability, even if it's trying to do so
You probably want to fetch various ZFS statistics to see how and what you're loading.
  • ZFS's copy-on-write means not altering blocks in a read-modify-write style. In some (relatively ideal) cases, this turns some somewhat-random access into mostly-sequential writes. (verify)
  • Apparently, up to fourish disks, reads are similar between raidz and mirror
  • mirror is easier to expand than raidz in smaller units - you'ld often just add a 2-disk mirror, while with raid-z your original vdevs probably easily involve four or five disks


http://constantin.glez.de/blog/2010/06/closer-look-zfs-vdevs-and-performance


You can often expect 200-500MB/s, though you may need a SLOG ZIL for that.


https://calomel.org/zfs_raid_speed_capacity.html


On benchmarking

Without enough RAM, ZFS is slower than most other things.

When it does have enough RAM, it does various clever things, tweaked somewhat for real-world use. Which means benchmarking for real use cases is a lot more meaningful than a dd test.

A system serving a bunch of files may be serving hundreds of GB/s - from RAM, with disk idling - regardless of filesystem.


zpool iostat -v 2 tells you about pool and individual disk I/O).

However, it gives no indication about the IO presented to users via ZFS cache hits. (gstat and iostat are in the same boat)

A script out there called arcstat.py will tell you more about cache behaviour.


Deletes are slow

They just are.

Particularly when done via the POSIX interface (in part because the POSIX API doesn't have bulk remove so sequential is the only way to do it(verify)).

But also, in any journaled filesystem it's a nontrivial thing to keep track of recoverability if you don't do it correctly.

And also, deleting is rarely as important to people as creating or altering files, which can weigh in some design choices).


Some journaled filesystems are faster, and many will seem much faster because they e.g. remove the directory entry and queue the rest of the deletes in the background), meaning the actual space will be freed soonish, but the shell command returns immediately.

(Note that you can imitate this in ZFS if you use snapshots, [1])

Notes on caches and buffers

ARC

(ARC itself is a cache strategy, basically a variant on LRU that is somewhat cleverer, e.g. won't always flush everything due to one bulk read).


L1ARC

L1ARC (we usually call it ARC) is system RAM used to cache recently read ZFS files.

This makes it similar in function to the OS's page cache, but is ZFS-only (and not tied into the OS like the page cache is, so was/is not quite as easily vacated when you suddenly need all your memory. This varies a little between OS implementations.).


So you can consider tweaking the ARC...

  • when you have other applications that assume they can use most of RAM - the two would always be fighting, causing swapping.
so it can make sense to limit one or both.
  • when it is important that RAM allocations to other things are served in the least time, you may want to set a limit to the ARC's size
  • when all your major applications manage their data in a cache-aware way (e.g. a postgresql server), it is preferable to e.g. tell ZFS to use ARC only for metadata, not file data.
  • In some cases you may be able to know that ARC won't help your load much - this is relatively rare, but may e.g. be the case for datasets so much larger than RAM they are usually read sequentially all the way through


The thing to set is zfs_arc_max, which if left to default (0) tends towards half of physical RAM(verify).

Innard-wise that gets translated to c_max (which is the thing to look for in /proc/spl/kstat/zfs/arcstats) If you're going to tune this, monitor your ARC for a while.

Setting can be done by

  • writing a value to /sys/module/zfs/parameters/zfs_arc_max
a byte value, so for 2GB it'd be e.g. echo 2147483648 > /sys/module/zfs/parameters/zfs_arc_max
it won't shrink very fast. You could clear all caches (which may make for a bunch of IO for a while, so can temporarily be a big performance dip on some types of server) with something like (sync; echo 3 > /proc/sys/vm/drop_caches)
  • apparently, a line in /etc/modprobe.d/zfs.conf (verify)

Note that too low a value defeats this cache (and tends to mean poor IO), too high and you get cache contention



Note that setting this on a running system does not mean ARC will shrink. You can force that by allocatting a lot of memory (e.g. using stress)


On errors and failure:

  • RAM failure is irrelevant, in that your system has just crashed anyway.
  • RAM errors are basically the question you chose to use ECC RAM. If you didn't, that was the risk you already took.


L2ARC

L2ARC is an optional device-based read cache.

This makes sense when

you have a subset of your data that you read frequently,
that data may be too large for it fit in RAM (i.e. the L1ARC) and not get evicted
but fits on e.g. an SSD.


For example, consider a photo site's image servers: a read-heavy workload, many smallish reads, and a good number of requests will be for a small subset, mostly the recently popular images, and this subset will probably largely fit within L2ARC.


This can raise throughput and IOPS and, assuming the pool is platter. Even if it doesn't raise the througput/IOPS, the variance of the two should be lower.


Using an SSD is preferred, as it has more effect on IOPS (and sometimes throughput).

You can use platter L2ARC, and they will help a little, though more because you're splitting work to one more drive (and somewhat targeted), but at the same time the cache's performance limit is that of a single disk, that has seeks almost as expensive as the pool's (it helps if you put L2ARC on a striped vdev, of course).

Basically, use a HDD L2ARC only when you can easily explain why actually, it will help.


You could get sometig out of a compressed ramdisk used for L2ARC, either through ZFS or through something like zram (which confusingly is entirely unrelated to ZFS). Yet it's a bit finicky so not always worth it.


On errors and failure: An ARC cache is basically just as safe as your system is in general, as:

regular checksumming applies, so no disk errors make it through.
if the storage backing L2ARC fails, ZFS transparently stops using L2ARC.

The last means that if L2ARC is designed to be part of why your production servers are fast, you probably want to mirror it.


Limitations

Consider that L2ARC has some management overhead, on the order of a few hundred MB to a GB per TB. (verify)

This, and the fact you'll probably have a Zipf-like distribution of benefit, means you should probably rarely have a very large L2ARC. (verify)





tweaking primarycache and secondarycache properties

ZIL

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The ZIL (ZFS Intent Log) is basically a transaction log (similar to those in databases, if you're familiar).

Note that ZIL contents are duplicated in RAM and on disk, and ZFS uses the RAM copy. Normal operation rarely reads contents from the disk copy - it is there for correctness and recovery, and read when importing. It is not a write buffer as such, just there to have a non-volatile copy. Yet it is still important to IOPS, because it is flushed to disk regularly.


The ZIL is used for all ZFS metadata writes(verify).

The ZIL also applies to sync writes. (with some details: sync writes smaller than 64KB have their content go to the ZIL, larger writes go to the pool and are pointed to by the ZIL. That threshold is tweakable. This seems to be a sensible performance consideration that has little other impact?(verify))

Non-sync write data skip the ZIL, they are only in RAM (usually effectively a few-second buffer(verify)).

It also applies to all all filesystem syscalls that apply to ZFS. (Technically, this means that all writes are mentioned in the ZIL(verify), white the data is only there for for (small) sync writes(verify))

The upshot is if recovery (after a crash / power loss) loses more async writes than sync writes. ...like any filesystem. Because that is a sensible bias that you want.

And due to ZFS's copy-on-write and transactional nature, you would just see and older version of the data. The ZIL is there in part so that "accepted into the ZIL" typically means "will be seen in the pool, be it now or a bit later".



So what of my everyday use goes via the ZIL?

  • Most everyday workstation stuff is async, so doesn't.
you may notice the ZIL barely ever gets used
everyday work also rarely saturates the disk, in which case the ZIL may not be a bottleneck at all.
  • everyday data processing is also typically async
...though may occasionally saturate the disks.
  • databases transaction logs writes are sync, and should be.
  • NFS is sync by default (note this includes some virtualization uses, such as ESXi)
  • iSCSI isn't always sync -- and you may want it to be (storage managers take note)
  • anything else that calls fsync, or used O_DSYNC
some programs can be configured to use such sync writes
  • You can apparently also force ZFS to treat all writes as async (but why?)


On place, size, and speed

The ZIL is by default stored within the pool that it serves.

This setup basically means that everything that goes via the ZIL are effectively written twice to a pool.

If that's platter, and you have a structurally high amount of sync writes (database server, fileserver, various virtualization), then you probably want to put the ZIL on a separate device.

Using an SLOG at all tends to help, and there is already value to using another platter disk.
Still, with SSD pricing these days it's usually a no-brainer.

Putting the ZIL on a sepecific device is termed having a SLOG, "Separate intent Log".


While it is not a write buffer, correctness still relies on the data to be in-ZIL before an operation is acknowledged.

As such, a faster ZIL means faster acknowledgment, higher IOPS to sync clients, and more predictable write performance. Most of those more so if you have raidz that isn't fully loaded.


For most workloads the ZIL need not be large, because the amount of not-yet-safe data is typically small.

Since a transaction group is often ~5 seconds, the rule of thumb is that 5 times your typical speed of writing is enough.
so even if all your data is sync and you average 400MB/s, that still means 2GB of your SSD is plenty (you probably still want a larger SSD to get more lifetime out of it).


When you care more about performance, and don't care that recovery doesn't always get you the latest data, then it's not insane to put your SLOG on tmpfs or similar RAMdisk. Though it often makes more sense to make sure only metadata and no data goes to ZIL.


On failure of device(s) backing the ZIL:

  • Early versions (≤v19?) could not deal well with ZIL failure, current versions can.
  • The ZIL content is duplicated in ARC (read: in RAM) until it is in the pool, so a disk that backs ZIL failing means no data loss (regardless of whether it's in-pool or SLOG)

There seem to still be roughly two failure cases:

  • SLOG fails (meanin ZIL data is only in RAM), power failure before ZFS adjusts to put ZIL data in the pool (basically the next txg)
  • power failure, SLOG device fails on bootup
(this is an argument against a HDD as SLOG)

Both can be mitigated with a mirror SLOG.


See also:


Tweaking recordsize and logbias

On write throttling

Notes on some features

NFS sharing

SMB sharing

Quota and reservations

On duplication to fight disk error (but not disk failure)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

It can make sense to selectively ask for more copies of your precious raw data by asking ZFS to store multiple copies of data. (basically for the same self-healing properties)


Yes, mirror vdevs imply the same, but ditto blocks allow what amounts mirroring at ZFS-filesystem rather than pool level.

tl;dr:

  • Ditto blocks refer to the fact that ZFS can store each block one, two, or three times (a low-level decision)
    • all metadata always gets stored multiple times for general robustness (2 or 3 depending on exactly what)
    • file data gets stored once, unless you request it to be stored 2 or 3 times
  • ditto blocks are separate (i.e. apply on top of) redundancy you get from mirrors/raidz
  • ditto blocks only help against disk error, not against disk failure
roughly because the pool can still be considered failed, even if there are multiple (good) copies of the file you are interested in.
Never plan on post-mortem recovery from failed pools, plan on pools not failing in the first place by using pool-level mirror/raidz


http://docs.oracle.com/cd/E19253-01/819-5461/gevpg/index.html

On compression

tl;dr:

  • transparent to applications.
  • If you know essentially none of your data is compressible, you could leave it off
  • ...though it won't compress data it doesn't think is compressible, so if you think it can ever help, enable and forget about it.
The overhead is low enough to be good for all but the most latency-sensitive needs
You may wish to use LZ4 when the default is LZJB
  • biased to compress/decompress quickly rather than to compress well
decompression uses some CPU, but typically adds very little latency


  • you could separate compressible and incompressible data into different ZFS filesystems
  • If your throughput bottleneck is in the IO hardware (relevant for storage clusters, and some small, few-disk setups), compression could actually help throughput (at negligible latency cost).
Depend a bunch on the setup, though.


To test compression:

# Create a filesystem for the test
zfs create mypool/comptest
 
# enable compression on it
zfs set compression=lz4 mypool/comptest
 
# now copy on some data (compression only applies to new writes)
 
# you could look at the overall compression ratio
zfs get all | grep ratio
# On an existing filesystem, it's often more informative 
# to see apparent size versus disk usage for your test files
du                  # reports how many bytes are actually used on platter
du --apparent-size  # reports the file's size that you see

On deduplication

ZFS provides deduplication, at the level of the blocks it writes.

It creates a lookup table for each pool it is enabled for, based on the same block hashes it uses internally. When hashes/data match, only one block is actually stored.


Most people want it very selectively, or not at all, for one main reason: namely that it's memory hungry, for what is usually little gain.


The dedup table has to be in memory for that pool to accept new writes at reasonable speed. The size of the dedup table (DDT) is proportional to three things:

DDT entry size (order of 100-320 bytes, this varies(with what?(verify))),
average block size used (typically lower than 128K),
and amount of (unique) blocks.

For not-highly-deduppable data, you can ballpark it at 1GB RAM per 1TB of data (sometimes more, e.g. many small files meaning smaller block sizes). (Putting the dedup table on an SSD would be a decent in-between in terms of speed and price, and this seems to actually be a feature now?.(verify)).


That GB-per-TB figure becomes a large factor in your cost/benefit calculation: Unless you know your users will have mostly duplicated data, the cost of drives you can save is probably lower than the cost of RAM you need to add.


There are some specific cases where dedup makes sense, like a backup server where you copy in files as-is every so often (...and not with snapshots), because those are mostly duplicate (and the occasional backup can be a bit sluggish).

That said, you may want to consider that dedup clashes with backup somewhat, in that you don't have all the copies you might think you have.

It can also make sense enabled only on specific filesystems in a pool.


Estimate whether it's possibly worth it
See how much it's saving
Finding which files have deduplicated blocks

Block Pointer Rewrite

Seems to be the name for "rewriting the data block as it isis", which people want so that:

pools with added disks get rebalanced
settings like dedup and compression get applied to files present before they were enabled


While there is no reason this code could not exist, there is decent reason it does not, and probably never will (until maybe the day that we stop adding to ZFS): Given ZFS's featureset (and write-once / copy-on-write nature (also basically the reason behind the 'no vdev removal' thing[2])), it's so involved that it may not happen at all, slightly conflicting (consider snapshots), and would make new features harder to add.


Which means that for said rebalancing/compression/dedup, you're better off somehow explicitly doing that move/rewrite yourself.

Other notes

Issues and errors

vmap allocation for size 2101248 failed: use vmalloc=size to increase size

...or for other sizes. Probably the spl_kmem_cache process is going crazy as well.


You're likely on 32-bit linux, which by default only has ~100MB virtual kernel memory.

This leaves too little for ZFS to use, which means ZFS performance nosedives on most operations.
For the 32-bit system I had this issue on:
# cat /proc/meminfo | grep Vmalloc
VmallocTotal:     122880 kB
VmallocUsed:      110352 kB
VmallocChunk:       1568 kB


You probably want to be running 64-bit anyway. There is typically no clean way to switch an existing system from 32-bit to 64-bit, so you'll have to reinstall.

If you don't want to switch (yet), you can steal more memory for the kernel with kernel argument like vmalloc=, see e.g. [3]

txg_sync taking a lot of CPU when under IO load

Possible influences:

  • If the pool is very nearly full, ZFS may be looking hard for free space to satisfy an operation
  • memory pressure forced by programs using more than they should
ZFS has historically had some problems where it would think there is memory pressure when there wasn't really
  • IO subsystem is bottlenecked(verify)
I had an issue using a RAID controller (configured as JBOD, and ZFS on top), where the controller's driver would stutter and pause IO entirely when under high load
...which, by the way, also triggered INFO: task blocked for more than 120 seconds.
which got better when I used noop rather than default cfq scheduler
  • bulk delete of many files doing its thing
  • worse when doing a lot of sync writes when you don't have a SLOG (because it means more IO)


https://github.com/zfsonlinux/zfs/issues/938

Filesystems mounted read-only

Additional devices are known to be part of this pool, though their exact configuration cannot be determined

...when trying to import.

You are likely to get more information when you specify a pool by its numeric id (would be listed on a zpool)

sudo zpool import 4109999925840410040


In my case it mentioned missing log device -- a ramdisk, which I hadn't made persistent. It was an experiment anyway, so I forced it to mount without this log device (import -m) and removed the missing log device from the configuration (zpool remove tank itsnumericidfromstatus)

Unsorted

On ZFS pool versions (historical)