Fsync notes

From Helpful
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

To programmers, the
(and related, like
) system calls basically mean "don't return until data has been written to stable media", for platter disks ideally meaning the platter itself.

This is useful for things like database engines, as it means they can do checkpointing in a way that can give guarantees about up what point things things can be recovered (even if the power fails immediately after the write).

Something similar can matter in file and filesystem correctness.


One thing that breaks fsync guarantees is write caching.

Drive bus controllers (e.g. SATA / SAS, or RAID devices) will do this when they are set to write-back setting rather than write-through. This will typically have a default, tha can be altered.

Hard drives could also do this, and you often have little control over that.

When either of the above uses write-back caching, it saying "I've written it to the platter" instead actually means "I received your data into my memory buffer and am about to start writing it".

In other words, they lie. For performance's sake.

which is nice in that the speedup is typically noticeable, and
which is bad because fsync loses its meaning

It has to be said that typically such a write is still done in under a second.

This still needs consideration, though. For example, databases will often write checkpoints with some regularity, meaning that if your disk subsystem lies, it probably won't recover to the last checkpoint but probably will to the one before it.

This difference may often be very acceptable to you.

And sometimes not. If saving every little but of data is much more important than performance (e.g. you cannot easily recreate your data and you have one primary data store.), the recoverability is more important than performance and you you don't want to mess with fsync settings, writeback settings, or use drives that lie.

Alleviated risk

In databases and other relatively critical uses, you can still get away with not making a program fsync all the time, or not honouring fsync (as the mentioned drives/controllers do), under certain conditions:

  • UPS - if you trust things to always cleanly shut down before it's empty, you can get away with fsync=off in terms of power failure (but not crashes. never crashes)
  • SCSI or RAID controller with persistent write cache (battery-backed or non-volatile) you can get away with writeback caching
in the case of batteries, the computer must restart within some amount of time. Also, do regular checks that the battery isn't dead. In the case of non-volatile cache, there is no such time limit.
  • You can get away with both if your database can reliably recover to a recent good state, and you don't care about that being "the most recent time you could possibly give me"
...or you're talking about redundant/slave systems you can easily rebuild or, in the VM world, reinstantiate

sync limits, other calls

On platter disk. you only get so many physical revolutions per second:

90 for 5400rpm,
120 for 7200rpm,
166 for 10000rpm,
250 for 15000rpm.

that's just dividing by 60, minutes to seconds. Which you mostly do to relate this a little more to IOPS,

The above is where 5~10ms of rotational latency comes from, which is a decent (but still rough!) indication of how fast you can do writes and/or fsyncs (because almost all real-world use contains enough randomly-positioned operations that you might as well treat it as basically all random)

In that without any clever planning, you tend to need a revolution to write the data, possibly two if it implies a metadata update and that's in a different place -- which is true for basic filesystem use.

Still-rough, because

IO schedulers can make writes better better. Usually the main idea is to keep the last few seconds of recent write operations in memory, and see if merging adjacent writes or re-ordering them helps. In certain cases this helps a lot (e.g. when many small adjacent writes can be merged and handled in a single revolution), in other cases not at all
fsyncs are a wall across no merging/reordering may happen, so frequent fsyncs reduce such optimizations to almost nil
and give you the predictability of things being on disk sooner, and predictably a few things require. See e.g. databases.
some other things can make it worse.

Still, the above is a major reason behind

why high-RPM disks were once very interesting for databases and other small-or-random-write-intensive purposes (now SSDs are)
why RPM is irrelevant for huge sequential file writing
why it is usually bad idea to have more than one write-happy application on the same platter disk (If you care about speed or latency, anyway)

Databases are an interesting and useful example.

RDBMSes are great at consistency management, and at guaranteed recoverability.

Any serious database will get to some previous good state regardless of just how often you fsync, yet how long ago that last good state was depends on a few things - its settings, how often you fsync, and also whether the drives (or controllers) lie or not.

So just how often you fsync is mostly a tradeoff between how many seconds of recent alterations you can accept losing (around a hard failure/reboot), and how efficient you want to use the disk. That choice can vary with whether you keep cat pictures or financial records, and whether you have a UPS.

Also note that use of database transactions are neutral-to-positive for performance, because larger transactions are larger writes, tend to happen only at commit time, so in fewer fsyncs.

Also, for things like database checkpoint files we only care about content, not whether the mtime/atime are accurate, or updated at all. (relevant because that's often a separate physical write). This is one reason alternative sync calls exist. For example, fdatasync() writes only the data, and does not update metadata that is not necessary for subsequent operations (like mtime and atime). Since different kernels and OSes will react differently (for example, Solaris' fdatasync is different and apparently not actually faster), it is hard to say universally which is the best option; only comparing them on your specific system will tell you which is best.

On SSDs, seek time is negligible.

This means read latency os low, though for other reasons, write latency is more interesting. Average write latency is low, but occasionally can be higher, and will vary depending on the SSD implementation (and usually the amount of recent writes).

On RAID, in most types latency is roughly on the order of that of the underlying disks, and beyond that is nontrivial to estimate or guarantee. It depends on the raid type, whether you're reading or writing, implementation details, and more.