Computer data storage - ZFS notes
| These are primarily notes|
It won't be complete in any sense.
It exists to contain fragments of useful information.
| Computer data storage
- 1 Why ZFS may be interesting for you
- 2 Practicalities on pool creation
- 2.1 ZFS terminology
- 2.2 On expansion, on balancing
- 2.3 How to refer to devices
- 2.4 Old or new sector size
- 2.5 Effects of sector size, stripe size, vdev size
- 2.6 does not contain an EFI label but it may contain partition
- 2.7 Actually creating (various layouts)
- 2.8 After pool creation
- 3 Practicalities later
- 4 Expectations
- 5 Notes on some features
- 5.1 NFS sharing
- 5.2 SMB sharing
- 5.3 Quota and reservations
- 5.4 On duplication to fight disk error (but not disk failure)
- 5.5 On compression
- 5.6 On deduplication
- 5.7 Block Pointer Rewrite
- 6 Other notes
- 6.1 Issues and errors
- 6.2 Unsorted
- 6.3 On ZFS pool versions (historical)
Why ZFS may be interesting for you
- ZFS chooses safety over speed
- scaling to more than a handful of disks is safer to do
- which is a good idea at today's scales, even with pricy disks but especially with cheaper disks.
- (it relates to the whole "Bit Error Rate has stayed constant while arrays are getting much bigger" thing)
- speed is often actually still quite decent (just because many platters/channels)
- ...it chooses to distrust disks
- It verifies checksums on all reads
- if it sees a mismatch it can fix (as in, enough verifiably good data from duplication/parity), it fixes it before returning the data
- if it sees a mismatch it cannot fix, it tells you by failing the read
- ...note that none of the above three is a necessarily true on any given classical RAID
- You should still scrub ZFS (some notes below)
- note that this does not deal with all possible sources of corruption, but more than most
- As fixes is done on demand, ZFS has no fsck. (A zpool scrub is close enough)
- multiple copies of metadata are kept
- so filesystem integrity is likely to survive sector failures errors/failures (more so than file data)
- transactioned writes - journaling, a little more so than some implementations.
- More protection at the cost of some performance
- can do stripe/mirror/parity stuff
- avoid some unnecessary parts of the read-calculate-write cycle that RAID5 and RAID6 do by being file-based.
- no write hole issues since it's COW (writes new block, then retires old one)
- causes slowdown when almost full
- yet generally a little faster than simpler software RAID
- can do transparent compression
- can do LVM, quota, snapshots
- replication - as in saving snapshots (and snapshot differences) to a file, and restore from them. And move them between pools and hosts.
- basic mechanism is that feeds data to
- A common tutorial example is to do this via SSH, to avoid an intermediate file, and if you had SSH set up anyway, it makes host-to-host replication a one-liner without a large temporary file
- not so ideal at very large scales, though
- deduplication at block level.
- however, you probably don't want to use it (some discussion somewhere below)
The heal-or-fail behaviour is probably the deal-maker for many, and the transparent compression a nice bonus.
Some argue that combining LVM, RAID, error checking, quota, compression, etc. into one monolithic thing thing breaks with the *nix "each program does one thing well" philosophy".
Other argue "not so much, when these are all aspects of the same thing (storage), and parts integrate with others better than they would do as independent pieces".
This probably should stay a lively discussion, because both have good points.
Some further concepts it builds with:
- filesystems - what ZFS calls filesystems is basically a folder within a pool that has an unique set of enabled features, so you can e.g.
- have quota in an 'allowed size of stuff in filesystem' way (group and user quota also exist)
- have compression of only of your research document collection
- have 3-way duplication of only your PhD thesis data
- have deduplication of only your users's home directories
- see also the term dataset (a more abstract term encompassing filesystems, snapshots, clones, and more)
- snapshots, copy-on-write style
- clones, a similar idea to snapshots, potentially useful for VMs and such
- zvol - basically a ZFS-backed block device
- (with options of snapshotting and compression and such, like with filesystems)(verify)
Limitations / food for thought:
- ZFS is copy-on-write by nature.
- It makes various things safer
- It makes various things a little slower
- It even makes a few things faster
- ZFS can be fast, but this is secondary to safety whenever it is a choice
- e.g. read-heavy workloads are easier than write-heavy ones
- some potentially clever read and write caching (with a bunch of details you may want to know)
- The ZIL is interesting to read up on when you want performance via a hybrid SSD+HDD setup
- ZFS loves memory. Enough that you may wish to buy some more.
- ...in part because when you use ZFS, you're likely to be setting up a larger-than-your-average workstation storage.
- How much? TODO:(verify)
- For people wishing to run it on their old hardware: With tweaking it can be made to run, but know what you're doing. Things likely will stutter more.
- For decent responsiveness on a pool, it'll quickly use 1GB, be more responsive with 4TB, preferably 6TB (partly people-repeating-each-other figures, but e.g. below 4TB it disables vdev prefetch)
- For large pools, add some RAM scaling with the pool size - 1GB of RAM per 1TB of actively used data is nice
- (If you use dedup, you'll probably need at least an additional 3GB of RAM per 1TB of data)
- you can't shrink storage, and you can't remove vdevs from a pool
- ...the latter means certain admin mistakes are bad, so think before you sudo
- It's software RAID in the end. For CPU-constrained hosts, this will mean speeds below raw disk speeds.
- For servers this can be negligible, but it can be an issue.
- the snapshot thing seems designed for archival backup, not restore speed. Fetching back files can be involved and slow.
- You probably don't want it to back iSCSI, for speed reasons. At least, not without a lot of reading and tweaking.
- VMs are not as bad, but still interesting.
- it fragments more easily
- Particularly on full pools. Avoid filling above 80% when you care about speed (true for most filesystems)
- Some workloads, e.g. random small alternations, will fragment even if not full
- If you primarily work with databases, search indexes, and other things that pretty much implement their own filesystem in the first place, then ZFS is not necessarily the thing you can squeeze the most performance out of.
- However, there are some tweaks and tricks you can apply - you can get good performance but still get ZFS's data integrity. If that combination sounds good, it's worth checking out (read up on ARC and ZIL if you do, and read that evil speed tweaking guide).
- If you are considering ZFS for a very important server, consider:
- Its MTTDL is lower than a lot of similar hardware/software RAID, and more controllable, but the requirements of a backup system are fundamentally distinct from those of a RAID system -- only a system designed with all backup-related issues in mind is a backup system.
- The real MTTDL will vary significantly with how you use ZFS, and with the design/quality of the hardware you use it on.
- In critical high-uptime servers you are always looking to avoid single points of failure. Most filesystem are single point of failure, and since ZFS itself doesn't cluster, it is no different. You may like it to be the filesystem underlying a distributed filesystem, though.
- healing ZFS is still not backup
- ZFS cannot guantee anything about the OS or hardware it is being run on -- its checks+heals is only about the disks. If e.g. data corrupts before it's handed to ZFS, ZFS will dutifully store that. So ZFS is only part of the puzzle. Corruption can, as with any filesystem, still happen in cases of:
- Bit flips in RAM. A few errors are still recoverable, but ECC is recommended if you really like your data.
- Badly seated RAM (tends to make for large errors)
- bad IOs (malicious or mistaken) not isolated via IOMMU (e.g. because the IOMMU is disabled)
- bugs in the kernel, in ZFS, in disk firmware, in controller firmware/drives
- disk systems that ignore flushes (e.g. that acknowledge the write command but postpone the actual write), and/or reorder IO operations around flushes. One example is RAID with write-cache that is not as protected as you thought, another is hard drives that increase speed by acknowledging things are on disk before they actually write it.
- See things like https://clusterhq.com/blog/file-systems-data-loss-zfs/
- not enough redundancy for a given issue. Consider that a misdirected write can work out as two mistakes in one.
Some history - projects, OSes, and support:
- ZFS on Solaris is the original (and initially only version), v1 though v28
- There are four specific-OS forks
- FreeBSD (was nice and stable rather before it was present on linux)
- ZFS on Linux (replacing the earlier, slower, FUSE-based version)
- OSX Server
- illumos (illumos being a community fork of OpenSolaris that is fully, not mostly open)
- After oracle bought Sun (in 2010), ZFS v29 onwards is closed source and essentially broke with the open source projects
- This initially made said efforts a bit uncoordinated, until they chose to tie their efforts more closely via the OpenZFS project. The idea there is to share all platform-independent changes and each makes them work on their own OS, so to be functionally equivalent.
Time will tell whether you're better off with Oracle's 'official' ZFS on Solaris or with the OpenZFS variant of your preference.
Practicalities on pool creation
In terms of major organization:
- Physical disks go in vdevs
- pools are backed by a set of vdevs
- pools store your files and have overall failure/speed characteristics according to their vdev layout
- and pools optionally subdivide into what they call filesystems (but are mostly functional)
On expansion, on balancing
Since ZFS also implements the LVM side of things, it can easily expand to more drives.
You can't shrink, though. Also expanding isn't always practically sensible.
There are two ways to increase the space in a pool:
- add a vdev to the pool
- Keep in mind that you can never remove a vdev from a pool. That means you can accidentally make nonsensical configurations (e.g. weakening safety), which you can't fix with anything less than creating a new pool and moving all your data there.
- replace all disks within a vdev with larger ones, resilvering inbetween each replacement(verify))
- Slow and tedious, but it works.
On balancing data between vdevs:
- In terms of safety, it balances new data between top-level vdevs (in RAID0-style)
- so has bearing on speed
- but very little bearing on redundancy/safety, in that the risk lies in the failure of a top-level vdev.
- You cannot explicitly ask ZFS to rebalance existing data in the pool (and this feature probably won't materialize)
- ...though because of the copy-on-write nature of ZFS, anything that involves (re)writing data will spread according to the pool at that time. New data is spread. Data that is altered will spread.
- so if you really really want to balance some files, you could make a copy of it and remove the original.
- ...and yeah, data that was written once before expanding and never altered after will stay on the original vdevs.
- if you copy-a-lot-of-data-then-delete, you may fill the pool enough to start fragmenting it
- ZFS will write more data to vdevs that have more space, so will balance on the long term
- (and rebalance if/when data gets rewritten)
- also means nice behaviour when you use disks of varying sizes
How to refer to devices
Any way you want, but keep in mind that /dev/sd? devices can change order and cause problems. It is recommended to use fixed references.
Basically: Look at the contents of /dev/disk/by-id/ and look at the drive label.
If the WNN is printed on the label, it's nice and shorter. If not, the ata-model_serial entry work just as well.
...though frankly, you may wish to buy a label printer anyway, just to tell which drive is which without taking them out :)
Old or new sector size
- is for new-style ("Advanced Format") 4KB-sector drives
- a bit more performance
- apparently the freespace calculations need to be more pessimistic (by ~8%) though it's not actually lost space(verify)
- is for old-style 512-byte sector drives
- There are a few cases where bootloaders may behave better with 512
Effects of sector size, stripe size, vdev size
does not contain an EFI label but it may contain partition
It's being careful about a drive that might possibly contain something.If you know you want to wipe the drives, add , meaning force.
Actually creating (various layouts)
Some words on layouts
- A single vdev can be:
- a lone disk (RAID0-like if multiple)
- n-way mirror (result is like RAID1 if one, RAID10 if multiple)
- raid-z1 set (like RAID5)
- raid-z2 set (like RAID6)
- raid-z3 set
- You cannot add members to existing raidz later.
- You can add members to an existing mirror later (see attach/detach)
- also applies to lone disks, which is one way to replace it (make it a mirror, wait to sync, detach the old disk)
- ZFS always stripes across a pool's top-level vdevs.
- so analogues to most classical RAID variants exist as
- RAID0: all disks at top level
- RAID1: add a single mirror at top level
- RAID10: add multiple mirrors at top level.
- RAID5/RAID6: add a single raidz1/raidz2 at top level
- RAID50/RAID60: Add multiple raidz1, raidz2 at top level
- ZFS allows any mix of vdev types at top level
- including stupid ones. You can e.g. make a top-level stripe between (a 20-disk raidz3 set) and a single disk
- since you can't remove vdevs, such a mistake is more or less permanent, so as always, think before you sudo
- so it usually makes most sense to make all top-level vdevs the same - in that the fault tolerance, redundancy, and performance are more predictable/constant
- raidz1 is risky for the same reason RAID5 is: while one disk is out, you have no protection and are at the mercy of the bit error rate.
- statistics says:
- most of your data will be okay, but a random tiny part of is not.
- How much has always depended largely on disk size, because BER has stayed mostly constant (while sizes have grown)
- if you care about space more than anything, then RAIDZ2 and RAIDZ3 are pretty nice up to a dozen disks, striping between multiple beyond that
- do read up on ZFS-specific details on most efficient setups, risk calculations, etc.
- If you care about performance being constant, more than about maximum space, then consider striping over mirrors. Yes, this is basically the old RAID10-over-RAID argument:
- in general operation, mirrors are faster than raidz, raidz is faster than raidz2 and raidz3 (verify)
- mirrors are also faster to resilver, meaning less time spent in a fault-sensitive state. Yes, raidz2 and raidz3 avoid that fault-sensitive state but also rebuild much slower (you may not care)
- Middle of the road is striping over smallish raidz
- if you care about a system that is both fault sensitivity and very fast, striping over three-way mirrors is probably the best choice
- ...though note that this only makes sense on very-high-speed networks
- fast is relative
- in a server, disks faster than your network is pointless
- use of many disks scales speed of some operations
A RAID0-like stripe
- detects errors (through basic ZFS checksumming)
- cannot heal errors
- can be extended later
- create by adding a bunch of one-drive vdevs directly to the top level of the pool:
zpool create poolname members
A RAID1-like mirror across two or more drives
- detect errors (through basic ZFS checksumming)
- heals errors when there is at least one copy that checks out (which is usually for 2-way mirrors, and almost always for ≥3-way mirrors. Keep in mind that with a disk failure it reduces to a one-less-disk mirror, which is why people do 3-way if their data is pretty important)
- create as a a single mirror at top level:
zpool create poolname mirror members
A stripe across mirrors (like RAID10),
- detects errors (through basic ZFS checksumming)
- heals errors (assuming there is at lease one copy that checks out)
- performs better than raid-z (much like RAID10 performs better than RAID5/RAID6)
- note that not any two disks can fail
- create by adding multiple mirror vdevs at top level, e.g:
zpool create poolname mirror member1 member2 zpool add poolname mirror member3 member4
A RAID-z1 stripe
- detects errors (through basic ZFS checksumming)
- adds parity on top of checksums (uses more disk space)
- Similar to RAID-5 in amount of missing blocks it can stand, except that
- ZFS guarantees a read is correct (checks and heals on the fly)
- ZFS can deal with more than UREs (because it checks data)
- ZFS can tell which vdev is at fault (verify) (necessary for the above)
zpool create poolname raidz member1 member2 member3 member4 member5 member6 member7
raidz, a.k.a. raidz1
- like above, can deal with two damaged blocks (similar to RAID-6)
- (Everyday?(verify)) performance between Z1, Z2, and Z3 is apparently quite similar.
- like above, can deal with three damaged blocks. Yet more peace of mind. No analogous classical-RAID level.
After pool creation
Sometimes it's useful to remember the admin commands you used earlier for this pool, e.g. when trying to remember what ashift parameter you used.
zpool history [poolname]
By default, ZFS mounts the pool at creation time, at /poolname
If you want it elsewhere, you'll want to know about
zfs set mountpoint=/mnt/here poolname
This is harder once programs have files open on it, but generally not a bother after creation. You can also do it at zpool create time, by adding -o mountpoint=/mnt/here
There are more details, of course. See things like:
There are reasons a pool can be happy, but not mounted.
I should figure out which, but it seems if setting the mountpoint didn't mount, then you can usually fix it with:
zfs mount poolname
In some cases it is useful that there is a property, which can e.g. prevent automounting per dataset (values are on, off, noauto)
If you need e.g. boot to fail hard on a missing ZFS pool, consider
zfs set mountpoint=legacy poolname/fsname
Then you can use an fstab entry like:
poolname/fsname /export/media zfs defaults 0 0
If you don't care about atime, then you may want to do:
zfs set atime=off poolname
The reason to scrub ZFS is different from scrubbing classic RAID.
In RAID that doesn't check read correctness every time, regular scrubs are necessary to even notice that you have been passing through incorrect data (details depend on things like your backup strategy).
In ZFS always checks reads, that specific reason for verifying isn't there, not is the implication you prefer it on a very short term.
But the need to worry is not gone either.
Consider that if you e.g. write once and then don't read for the next decade, you will not notice deterioration.
While ZFS makes it much likelier that longer-term bit rot is correctable, data can still die. In some cases that can be avoided with scrubs, in other cases it cannot but you still want to know about it - again, probably within a time that probably relates to your backup strategy.
Since there is much less hurry, ZFS scrubs are low-priority, and will yield to regular disk use. This is why scrubs can take weeks.
Which is sometimes preferable. You can tweak it to be more aggressive if you care.
(Note that ZFS also chooses to only check files, i.e. your data rather than your disks)
Health summary (will also mention things like an ongoing scrub):
zpool status [poolname]
-v gives you more details about any errors
See current performance of pools with zpool iostat. Below, 2 is the interval in seconds:
zpool iostat [poolname] 2
-v gives you per-drive performance
Doing a scrub:
zpool scrub poolname
You should probably cronjob that.
Dealing with errors and failure
sync writes versus non-sync writes
Keep in mins that ZFS does write throttling - see section below
- IOPS increases with the amount of striping, so if you want more IOPS, the best way is to add lots of smallish vdevs at top level - basically RAID10 style.
- (...in part because...) vdevs themselves are more or less IOPS-bound by their slowest disk (including RAIDZ, even though they do stripe).
- more disks in striping. You can hit ~1GBy/s for sequential use with under a dozen disks. (and that may be your bus limit)
- Relative performance of various vdevs
- striping along a pool's top-level vdevs (which pools always do)
- The best way to scale aggregate IOPS and bandwidth.
- ...aggregate in that each program will probably not see speeds much faster than a single vdev, but it is likelier that each will go to a smaller part of the whole and interrupt each other less.
- mirror vdevs
- write IOPS: that of a single disk (specifically the slowest, because for atomicity a write is only done once the data and metadata is all written)
- write bandwidth: that of a single disk
- read IOPS: scales decently because they are round-robined (details differ between relatively sequential/random patterns)
- bandwidth: scales decently, for the same reason
- random performance: better (like in basic RAID10 versus parity RAID)
- raidz vdevs (A filesystem block goes to most/all of the vdev's drives)
- write IOPS: single disk's
- write bandwidth: aggregate (since it's basically striping to most disks)
- can beat mirror on sequential write for largeish arrays, because of the amount of striping (but these returns do diminish)
- but for small writes there is little difference as seek time dominates write time
- reads: sequential good, random bad
- reading does not really get a round robin bonus unless you're lucky (sequential reading of previously sequentially written data?)
- random reading is slowish (as even smaller reads read *must* come from all disks)
- writes need more CPU than mirror since they calculate parity (but on dedicated storage host this should not be limiting)
- (no write hole penalty)
- in terms of amount of data touched, raidz avoids most of the write penalty that RAID 5/6 volumes have. But more importantly, the related data integrity issues.
- At the cost of CPU time, but your storage should have these CPU cycles to spare (ZFS may use less CPU than similar software RAID because of content awareness(verify))
- the penalty on small writes present in classical RAID is lower in raidz for many workloads.
- In vdevs, mirrors outperform raidz in the cases that matter to most people (as in, for many real-world loads)
- Exception: for pure sequential write, and write-mostly loads, (and the same amount of striping,) there should be little difference between mirror and raidz (clever scheduler).
- When doing benchmarks
- keep in mind that a single program doing small operations may not saturate the pool capability, even if it's trying to do so
- You probably want to fetch various ZFS statistics to see how and what you're loading.
- ZFS's copy-on-write means not altering blocks in a read-modify-write style. In some (relatively ideal) cases, this turns some somewhat-random access into mostly-sequential writes. (verify)
- Apparently, up to fourish disks, reads are similar between raidz and mirror
- mirror is easier to expand than raidz in smaller units - you'ld often just add a 2-disk mirror, while with raid-z your original vdevs probably easily involve four or five disks
You can often expect 200-500MB/s, though you may need a SLOG ZIL for that.
Without enough RAM, ZFS is slower than most other things.
When it does have enough RAM, it does various clever things, tweaked somewhat for real-world use.Which means benchmarking for real use cases is a lot more meaningful than a test.
A system serving a bunch of files may be serving hundreds of GB/s - from RAM, with disk idling.
tells you about pool and individual disk I/O).
However, it gives no indication about the IO presented to users via ZFS cache hits. (gstat and iostat are in the same boat)
A script out there called arcstat.py will tell you more about cache behaviour.
Deletes are slow
They just are.
They are in many journaled filsystems, at least when this is done via the POSIX interface (in part because the POSIX API doesn't have bulk remove so sequential is correct(verify)).
Any journaled filesystem will have the underlying delete be a little slowish - it's a nontrivial thing to do and keep track of recoverability (and it's rarely as important to people as creating or altering files, which can matter in some design choices).
Some journaled filesystems are faster, and many will seem much faster because they e.g. remove the directory entry and queue the rest of the deletes in the background), meaning the actual space will be freed soonish, but the shell command returns immediately.
(Note that you can imitate this in ZFS if you use snapshots, )
Notes on caches and buffers
ARC (L1ARC) is system RAM used to cache recently read files.
This makes it similar in function to the OS's page cache, but is ZFS-only.
On linux the ARC is not tied in the OS as e.g. the page cache, so is not quite as easily vacated (varies with OS-specific implementation, apparently most are better behaved than linux's(verify))
So you can consider tweaking the ARC...
- when your major applications manage memory cache-awarely / cleverly (e.g. postgresql), it is preferable to tell ZFS to use ARC only for metadata, not file data.
- when it is important that all RAM allocations are served in the least time, you may want to set a limit to the ARC's size
- when you have other applications that assume they can use ~80% of RAM, it would always be fighting ZFS, and you may want to limit one or both.
- In some cases you may also know that ARC won't help your load much - this is relatively rare, but may e.g. be the case for very' large purely-sequential files.
The thing to set is zfs_arc_max, which if left to default (0) tends towards half of physical RAM(verify). Innard-wise that gets translated to c_max (which is the thing to look for in /proc/spl/kstat/zfs/arcstats) If you're going to tune this, monitor your ARC for a while.
Setting can be done by
- writing a value to /sys/module/zfs/parameters/zfs_arc_max
- a byte value, so for 2GB it'd be e.g.
- it won't shrink very fast. You could clear all caches (which is going to make for a lot of IO for a while, so can temporarily be a big performance dip on some types of server) with something like (Template:Ininecode)
- apparently, a line in /etc/modprobe.d/zfs.conf (verify)
Note that too low a value defeats this cache (and tends to mean poor IO), too high and you get cache contention
Note that setting this on a running system does not mean ARC will shrink. You can force that by allocatting a lot of memory (e.g. using stress)
On errors and failure':
- RAM failure is irrelevant in a system-has-crashed-or-not way.
- RAM errors are basically the question you chose to use ECC RAM. If you didn't, that was the risk you already took.
L2ARC is an optional device-based read cache.
Consider using an L2ARC when you have a subset of your data that you read frequently, which is too large to keep in RAM but fits on e.g. an SSD.
Also consider that L2ARC has management overhead proportional to how large the device is (the data actually on there, actually, but *handwave*), on the order of a few hundred MB to a GB per TB. (verify) This, and the fact you'll probably have a Zipf-like distribution of benefit, means you should probably rarely have a very large L2ARC. (verify)
One good use case for L2ARC is a photo site's image servers: a read-heavy workload, many smallish reads, and a good number of requests being will be for recently popular images, i.e. a small and fairly consistent subset that will probably largely fit within L2ARC.
This can raise throughput and IOPS. Even if it doesn't raise, the variance in each numbers should be lower (and effect on other disk use lower).
Using an SSD is preferred, as it has more effect on IOPS (and sometimes throughput).
You can use platter L2ARC, and they will help a little, though the more because you're splitting work to one more drive (and somewhat targeted), but at the same time the cache's performance limit is that of a single disk that has seeks almost as expensive as the pool's (it helps if you put L2ARC on a striped vdev, of course).
Basically, use a HDD L2ARC only when you can easily explain why it will help.
Note that you can get a compressed ramdisk as L2ARC, either through ZFS on a regular ramdisk, using a compressed ramdisk (like zram, confusingly entirely unrelated to ZFS). Yet it's a bit finicky so not always worth it.
On errors and failure: An ARC cache is basically just as safe as your system is in general, as:
- regular checksumming applies, so no disk errors make it through.
- if the storage backing L2ARC fails, ZFS transparently stops using L2ARC.
The last means that if L2ARC is designed to be part of why your production servers are fast, you probably want to mirror it.
tweaking primarycache and secondarycache properties
The ZIL (ZFS Intent Log) is basically a transaction log (similar to those in databases, if you're familiar).
Note that since ZIL contents are duplicated in RAM and on disk, and ZFS uses the RAM copy. Normal operation rarely reads contents from the ZIL - it is there for correctness and recovery, and read when importing. It is not a write buffer as such, just there to have a non-volatile copy. Yet it is still important to IOPS, because it is flushed to disk regularly.
The ZIL is used for all ZFS metadata writes(verify).
The ZIL also applies to sync writes. (with some details: sync writes smaller than 64KB have their content go to the ZIL, larger writes go to the pool and are pointed to by the ZIL. That threshold is tweakable. This seems to be a sensible performance consideration that has little other impact?(verify))
Non-sync write data skip the ZIL, they are only in RAM (usually effectively a few-second cache(verify)).
It also applies to all all filesystem syscalls that apply to ZFS. (Technically, this means that all writes are mentioned in the ZIL(verify), white the data is only there for for (small) sync writes(verify))
The upshot is if recovery (after a crash / power loss) loses more async writes than sync writes. ...like any filesystem. Because that is a sensible bias that you want.
And due to ZFS's copy-on-write and transactional nature, you would just see and older version of the data. The ZIL is there in part so that "accepted into the ZIL" typically means "will be seen in the pool, be it now or a bit later".
So what of my everyday stuff is included?
- Most everyday workstation stuff is async
- you may notice the ZIL barely ever gets used
- everyday work also rarely saturates the disk, in which case the ZIL may not be a bottleneck at all.
- everyday data processing is also typically async
- ...though may occasionally saturate the disks.
- databases transaction logs writes are sync, and should be.
- NFS is sync by default (note this includes some virtualization uses, such as ESXi)
- iSCSI isn't always sync -- and you may want it to be (storage managers take note)
- anything else that calls fsync, or used O_DSYNC
- some programs can be configured to use such sync writes
- You can apparently also force ZFS to treat all writes as async (but why?)
On place, size, and speed
If you have a structurally large amount of sync writes (database server, fileserver, various virtualization), then you probably want to put the ZIL on a separate device. Doing so is termed having a SLOG, "Separate intent Log".
While it is not a write buffer, correctness still relies on the data to be in-ZIL before an operation is acknowledged.
As such, a faster ZIL means faster acknowledgment, higher IOPS to sync clients, and more predictable write performance. Most of those more so if you have raidz that isn't fully loaded.
...in particular for platter-disk pools -- because by default the ZIL is stored within the pool it serves, meaning many sync writes are effectively written twice to a pool that is seek-sensitive by nature.
As such, using an SLOG at all tends to help, and there is already value to using a platter disk. Still, with SSD pricing these days it's usually a no-brainer.
For most workloads the ZIL need not be large, because the amount of not-yet-safe data is typically small. Since a transaction group is often ~5 seconds, the rule of thumb is that 5 times your typical speed of writing is enough. Even if all your data is sync and you average 400MB/s, that still means 2GB of your SSD is plenty (you probably still want a larger SSD to get more lifetime out of it).
When you care more about performance, and don't care that recovery doesn't always get you the latest data, then it's not insane to put your SLOG on tmpfs or similar RAMdisk. Though it often makes more sense to make sure only metadata and no data goes to ZIL.
On failure of device(s) backing the ZIL:
- Early versions (≤v19?) could not deal well with ZIL failure, current versions can.
- The ZIL content is duplicated in ARC (read: in RAM) until it is in the pool, so a disk that backs ZIL failing means no data loss (regardless of whether it's in-pool or SLOG)
There seem to still be roughly two failure cases:
- SLOG fails (meanin ZIL data is only in RAM), power failure before ZFS adjusts to put ZIL data in the pool (basically the next txg)
- power failure, SLOG device fails on bootup
- (this is an argument against a HDD as SLOG)
Both can be mitigated with a mirror SLOG.
Tweaking recordsize and logbias
On write throttling
Notes on some features
Quota and reservations
On duplication to fight disk error (but not disk failure)
It can make sense to selectively ask for more copies of your precious raw data by asking ZFS to store multiple copies of data. (basically for the same self-healing properties)
Yes, mirror vdevs imply the same, but ditto blocks allow what amounts mirroring at ZFS-filesystem rather than pool level.
- Ditto blocks refer to the fact that ZFS can store each block one, two, or three times (a low-level decision)
- all metadata always gets stored multiple times for general robustness (2 or 3 depending on exactly what)
- file data gets stored once, unless you request it to be stored 2 or 3 times
- ditto blocks are separate (i.e. apply on top of) redundancy you get from mirrors/raidz
- ditto blocks only help against disk error, not against disk failure
- roughly because the pool can still be considered failed, even if there are multiple (good) copies of the file you are interested in.
- Never plan on post-mortem recovery from failed pools, plan on pools not failing in the first place by using pool-level mirror/raidz
- transparent to applications.
- If you know essentially none of your data is compressible, you could leave it off
- ...though it won't compress data it doesn't think is compressible, so if you think it can ever help, enable and forget about it.
- The overhead is low enough to be good for all but the most latency-sensitive needs
- You may wish to use LZ4 when the default is LZJB
- biased to compress/decompress quickly rather than to compress well
- decompression uses some CPU, but typically adds very little latency
- you could separate compressible and incompressible data into different ZFS filesystems
- If your throughput bottleneck is in the IO hardware (relevant for storage clusters, and some small, few-disk setups), compression could actually help throughput (at negligible latency cost).
- Depend a bunch on the setup, though.
To test compression:
# Create a filesystem for the test zfs create mypool/comptest # enable compression on it zfs set compression=lz4 mypool/comptest # now copy on some data (compression only applies to new writes) # you could look at the overall compression ratio zfs get all | grep ratio # On an existing filesystem, it's often more informative # to see apparent size versus disk usage for your test files du # reports how many bytes are actually used on platter du --apparent-size # reports the file's size that you see
ZFS provides block-level deduplication.
When enabled, it creates a lookup table, based on the same block hashes it uses internally, and when hashes/data match, only one block is actually stored.
There is one basic reason that most people won't want it, namely that it's memory hungry. For not-highly-deduppable data (most non-specific datasets), you can ballpark it at "1GB RAM per 1TB of data" (up to 5GB-ish for worse cases e.g. having mostly-small block sizes, e.g. lots of small files), and more importantly, this needs to sit in RAM if you want to accept new data at a vaguely reasonable speed (the alternative is akin to swapping this table in).
That GB-per-TB figure becomes a large factor in the cost/benefit calculation: Unless you know your users will have mostly duplicated data, the cost of drives you can save is probably lower than the cost of RAM you need to add.
Putting the dedup table on an SSD would be a decent inbetween in terms of speed (to avoid it coming from platter and the same pool) and price, but this seems to not be a feature yet.(verify)
There's some specific cases where it may still make sense, like a backup server that takes files as-is, because those are mostly duplicate and the occasional backup doesn't can be a bit sluggish.
That said, you may want to consider that dedup clashes with backup somewhat, in that you don't have all the copies you might think you have.
Estimate whether it's possibly worth it
See how much it's saving
Finding which files have deduplicated blocks
Block Pointer Rewrite
Seems to be the name for "rewriting the data block as-is", which people want so that:
- pools with added disks get rebalanced
- settings like dedup and compression get applied to already-present files
Probably not going to happen anytime soon, and there are certain sounds along the lines of "it's getting so involved that it may not happen at all".
Which means that things like full compression/dedup on a filesystem that didn't previously have it disabled, or rebalancing, you're better off somehow explicitly doing that move/rewrite.
Issues and errors
vmap allocation for size 2101248 failed: use vmalloc=size to increase size
...or for other sizes. Probably the spl_kmem_cache process is going crazy as well.
That error means You're likely on 32-bit linux, which by default only has ~100MB virtual kernel memory.
- This leaves too little for ZFS to use, which means ZFS performance nosedives on most operations.
- For the 32-bit system I had this issue on:
# cat /proc/meminfo | grep Vmalloc VmallocTotal: 122880 kB VmallocUsed: 110352 kB VmallocChunk: 1568 kB
You probably want to be running 64-bit OS anyway. There is typically no clean way to switch an existing system from 32-bit to 64-bit, so you'll have to reinstall.
If you don't want to switch (yet), you can steal more memory for the kernel with kernel argument like vmalloc=
txg_sync taking a lot of CPU when under IO load
- If the pool is very nearly full, ZFS may be looking hard for free space to satisfy an operation
- happens much easier when writes are mostly async
- memory pressure forced by programs using more than they should
- ZFS has had problems where it would think there is memory pressure when there wasn't really
- IO subsystem is bottlenecked(verify)
- I had an issue using a RAID-as-JBOD setup, where the controller's driver would stutter and pause IO when under high load
- ...which also triggered INFO: task blocked for more than 120 seconds.
- which got better when I used noop rather than default cfq scheduler
- bulk delete of many files doing its thing
- worse when doing a lot of sync writes when you don't have a SLOG (because it means more IO)
Filesystems mounted read-only
Additional devices are known to be part of this pool, though their exact configuration cannot be determined
...when trying to import.
You are likely to get more information when you specify a pool by its numeric id (would be listed on a zpool)
sudo zpool import 4109999925840410040
In my case it mentioned missing log device -- a ramdisk, which I hadn't made persistent. It was an experiment anyway, so I forced it to mount without this log device (import -m) and removed the missing log device from the configuration (zpool remove tank itsnumericidfromstatus)