Fsync notes

From Helpful
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


To programmers, the fsync() (and related, like sync(), datasync()) system calls basically mean "please block this syscall until data has been written to stable media", for platter disks ideally meaning the platter itself.


This is useful for things like database engines, as it means they can do checkpointing in a way that can give guarantees about up what point things things can be recovered (even if the power fails immediately after the write).

Something similar can matter in file and filesystem correctness.


Exceptions

One thing that can break fsync guarantees is write caching.


Drive bus controllers (e.g. SATA / SAS, or RAID devices) may do this, when they are set to write-back setting rather than write-through. This will typically have a default, that can be altered.

Hard drives themselves could also be doing this, and if they do, you may have little control over that.


When either of the above uses write-back caching, it saying "I've written it to the platter" instead actually means "I received your data into my memory buffer and will get to it soon, promise".

In other words, they lie. For performance's sake.

which is nice in that the speedup is typically noticeable, and
which is bad because fsync loses its meaning.


Since

most writes are 'fire and forget' anyway,
if you lose power in the middle of it you lose it either way,
such writes are often still done in under a second, often much less. (but no guarantees unde rload)
it gives higher throughput.

...this tradeoff may often be very acceptable to you.


And sometimes not. In particular, the very point of databases doing regular checkpoints is often that we have a point we know we can recover to, and doing that requires something like fsync.

So if your disk subsystem lies, it probably won't recover to the last checkpoint but likely will recover to the one before it (assuming it's there).

Whether you find that acceptable depends a little on the industry you're in.

Alleviated risk

In databases and other relatively critical uses, you can still get away with not making a program fsync all the time, or not honouring fsync (as the mentioned drives/controllers do), under certain conditions:

  • UPS - if you trust things to always cleanly shut down before it's empty, you can get away with fsync=off in terms of power failure (but not crashes. never crashes)
  • SCSI or RAID controller with persistent write cache (battery-backed or non-volatile) you can get away with writeback caching
in the case of batteries, the computer must restart within some amount of time. Also, do regular checks that the battery isn't dead. In the case of non-volatile cache, there is no such time limit.
  • You can get away with both if your database can reliably recover to a recent good state, and you don't care about that being "the most recent time you could possibly give me"
...or you're talking about redundant/slave systems you can easily rebuild or, in the VM world, reinstantiate

sync limits

On platter disk you only get so many physical revolutions per second:

90 for 5400rpm,
120 for 7200rpm,
166 for 10000rpm,
250 for 15000rpm.

Which is just dividing by 60, minutes to seconds.


You can relate rpm to IOPS, and to maximum rate of fsync. Roughly, anyway.


The 5~10ms of latency that all platter drives have come partly from the rotational speed. For example, at 7200rpm it takes 1/120 of a second (~8ms) to spin around once. On average, we are at half of that away from where we want to be, plus we need to actuate the head within that time, plus some other things that aren't entirely instant. So chances are decent we need much of that time before the write actually starts.


There's a lot of footnotes to that (some things working for us, some against), but the point is that this makes it a decent (but still rough!) indication of how fast you can do writes and/or fsyncs.

Read-heavy loads tend to be a bit nicer because more of them are sequential (also meaning readahead helps us), but many write loads, and mixed loads (which are likely when hardware is not dedicated to a single task), we have enough randomly-positioned operations that you might as well treat it as basically all random.

If you assume every operation is a distinct place, and zero clever planning, then you tend to need ~1 revolution for each operation. This is why platter drive IOPS tends to lie in the 60~250 range.



variations on fsync

Databases are an interesting and useful example.

RDBMSes are great at consistency management, and at guaranteed recoverability.

Any serious database will get to some previous good state regardless of just how often you fsync, yet how long ago that last good state was depends on a few things - its settings, how often you fsync, and also whether the drives (or controllers) lie or not.


So just how often you fsync is mostly a tradeoff between how many seconds of recent alterations you can accept losing (around a hard failure/reboot), and how efficient you want to use the disk. That choice can vary with whether you keep cat pictures or financial records, and whether you have a UPS.

Also note that use of database transactions are neutral-to-positive for performance, because larger transactions are larger writes, tend to happen only at commit time, so in fewer fsyncs.


Also, for things like database checkpoint files we only care about content, not whether the mtime/atime are accurate, or updated at all. (relevant because that's often an extra physical write). This is one reason alternative sync calls exist.

For example, fdatasync() writes only the data, and does not update metadata that is not necessary for subsequent operations (like mtime and atime). Since different kernels and OSes will react differently (for example, Solaris' fdatasync() is different and apparently not actually faster), it is hard to say universally which is the best option; only comparing them on your specific system will tell you which is best.



On SSDs, there isn't really such a thing as seek time, just operation overhead. From the classical storage view this is almost negligible and read latency is low.


Write latency is higher than that - though for other reasons, and ones that are more interesting.

Average write latency is low, but occasionally write latency can be high, and will vary depending on the SSD implementation (and usually the amount of recent writes).
Particularly in earlier SSDs this wasn't too predictable (because of optimization to specific cases).


On RAID, in most types latency is roughly on the order of that of the underlying disks, and beyond that is nontrivial to estimate or guarantee. It depends on the raid type, whether you're reading or writing, implementation details, and more.