File polling, event notification, and asynchronous IO
Why would you want this?
Polling means regularly checking the state of something else, to see whether something has changed.
Polling doesn't scale very well, though:
- regularly checking a lot of things tends to become proportionally slower to evaluate
- often less because because the check is slow or hard, and more because the means of communication is.
- checking more often, for lower latency, tends to become more costly to do (because you do it more often)
- multiple things polling the same things tends to become more costly to do (because you do it more often)
What's the alternative?
There are a few approaches but many address the issues with two main ingredients:
- the "polling by multiple things" can be alleviated by putting a single thing in charge
- which notifies you (ideally' there are some inbetween forms)
- the "monitoring lots of things" and "the polling more often for lower latency" problem can be alleviated by making that single thing somehow monitor absolutely the mechanism of changes, instead of interrogating the state and noticing that things have changed
- which then notifies you
Notes:
- We often call these event notification systems
- Desktop search and sync clients will probably stand to gain the most. Your battery as well.
- distributed systems will need their own flavours of these
Underlying mechanisms
noticing changes with manual stats
Availability: anything (with a POSIX filesystem interface, or similar)
Idea: frequenly stat() things, and notice when e.g. its ctime/mtime/size has changed.
Pros:
- simple to implement
- works almost everywhere - doesn't rely on a kernel/OS-specific feature feature
- you can't e.g. miss a change by missing an event
- Possible optimization:
- a directory's mtime will change when a file/dir entry is added or removed
- so if a stat() of a known directory has the same mtime(), you don't need to stat that directory's files contents
- however, directory mtime doesn't change when only file contents are changed (because that's not a directory-altering operation). This is mostly irrelevant for e.g. updatedb, but matters for a bunch of other applications so you would still stat each file
- watching a directory tree this way has some further rough edges
Cons:
- doing this for thousand+ of files, or noticing quickly, means a lot of syscall overhead
- more so for network filesystems
- relies on cacheing of filesystem metadata to be decently fast (and to not be a lot of IO continuously)
POSIX select and POSIX poll
Availability: most *nices (though poll was not in the first POSIX versions, so isn't there on some now-ancient things)
select is older so was more widely present, so you would use it, or at least still implement it as a fallback.
These days poll is everwhere worth noting, so generally preferred over select.
Both can watch one or more file descriptors (sockets, files, etc).
Pros:
- 'wait for kernel to mention change' means it stays fast in most cases
- fairly ubiquitous
Cons:
- Doesn't report much detail, so for some purposees it's still fairly clunky and only part of the solution
- each select can't watch a lot of file descriptors
- select is a fixed-size structure (size defined at kernel-compile time, see FD_SETSIZE)
- poll requires you to allocate the array of fd references
- so not the most efficient to scale to watching many things
See also:
/dev/poll
Availability: Solaris-only
Alternative interface, functionally mostly like *nix poll.
Not implemented on linux because by that time, epoll was nicer anyway.
event ports
Availability: (Solaris, Illumos, versions?(verify))
Event ports are a generic event system.
File Events Notification are basically when you use that for file descriptors(verify)
https://blogs.oracle.com/dap/entry/event_ports_and_performance
https://docs.oracle.com/cd/E36784_01/html/E36874/port-create-3c.html
epoll
Availability: Linux-only.
When compared to poll(), epoll() scales better to watching a larger number file descriptors,
because it uses a data structure to avoid iterating over all watched descriptors each time.
Notes
- present since linux 2.5.something
- the API is more complex and a little more flexible than poll()'s,
- up to maybe a thousand, you won't see much difference(verify)
- changing what you're polling is roughly as expensive as with poll (because kernel),
- so if you change often, it may make little difference(verify).
See also:
dnotify
Availability: Linux-only, since around 2.4
Can only watch directories, and uses an open fd for each watch.
Note that as file alterations in a directory also marks the contained directory as changed, this can be a fairly minimal way to report "has something changed in this directory / tree?",
Functionally, you may well want inotify now - it can do much the same.
inotify
Availability: Linux-only, kernel ≥2.6.13 and glibc ≥2.4 / 2.5, so since ~2005.
Practical notes:
- Watches inodes, so things with entries on a filesystem
- (...and not e.g. things you can only refer to via file descriptors, such as sockets)
- does not apply to (mounted) remote filesystems, mostly because remote change does not involve a local syscall
- and you probably wouldn't want to watch remote changes this way (arguably at all) -- it would scale badly
- Recursive watching
- is not provided by the kernel component, because it's a complex task for which there is no minimum-latency in-kernel way of doing it.
- libraries tend to provide it
- also e.g. automatically watch newly created directories.
- inotify's interface is itself a file descriptor, so can be watched with epoll (or poll or select)
For CLI tools, see e.g. #inotifywait.2C_inotifywatch
Applicable limits
Some tuning you may need to do:
- fs.inotify.max_user_watches
- how many filesystem items can be watched (per user?)
- may default to as low as 8192, but recentish installs set it vaguely proportionate to your amount of RAM, e.g. 60000
- for some purposes (e.g. sync clients when you keep lots of small files) you may need tens of thousands or more
- actual use costs ~1KB (of unswappable kernel memory) per watch, so when you actually start watching a million files that means ~1GB of physical RAM
- (when you actually watch such a high amount of files, also look at the CPU overhead)
- Note that multiple processes watching the same files/dirs is almost free in terms of watches, because you get references to existing watches
- fs.inotify.max_user_instances
- defaults to something like 128
- since there are rarely that many individual things watching your files, you don't often need to increase it.
- fs.inotify.max_queued_events
- defaults to something like 16384 (time ~272 bytes each is ~4MB)
- how many un-consumed events to keep (per user) before throwing old ones away -- assuming applications using them is paying active attention this you shouldn't need this very high
- See about your clients before you increase this, because if the reason is that they are consuming slower than inotify is generating, a larger queue doesn't really help (for very long)
If you want to know which processes are using inotify at all, you can get the number of instances (not watches) per named processes, with something like: (thanks to [1]):
find /proc/*/fd -lname anon_inode:inotify 2>/dev/null | cut -d/ -f3 | xargs -I '{}' -- ps --no-headers -o '%p %U %c' -p '{}' | uniq -c | sort -nr
Seeing a watch count, or a list of all watches, is much harder.
The API doesn't seem to provide either, so this takes some work.
People suggest lsof for the count, although it seems this isn't universal (verify)
You can strace a process to look for inotify_add_watch, but this is only directly useful if you restart it.
What things you can ask for
The mask is used both to
- ask the kernel to filter out events sent to us,
- ask the kernel to check what exactly each event does
A few are request-only:
IN_ONESHOT only send event once IN_ONLYDIR only watch the path if it is a directory IN_DONT_FOLLOW don't follow a sym link IN_EXCL_UNLINK exclude events on unlinked objects IN_MASK_ADD add to the mask of an already existing watch
A few are response-only: (verify)
IN_ISDIR event occurred against dir
A few are special-case events:
IN_Q_OVERFLOW event queue overflowed. Lets your app e.g. log the error that it needs to read them faster (and maybe raise fs.inotify.max_queued_events)
Most of them relate to accesses/alterations you may want to look for, so can be used in both the request and the result (verify)
IN_CREATE File/directory created in watched directory, e.g., open() with O_CREAT, mkdir(), link(), symlink(), bind() on a domain socket IN_ATTRIB metadata change, e.g. chmod(), chown(), utimensat(), setxattr(), link count (link()/unlink()) since Linux 2.6.25 IN_OPEN file/directory opened IN_ACCESS e.g. read(2), execve(2) IN_MODIFY e.g. write(), truncate() IN_CLOSE_WRITE File opened for writing was close()d IN_CLOSE_NOWRITE File or directory not opened for writing was close()d IN_DELETE file/directory deleted from watched directory IN_DELETE_SELF watched file/dir was deleted (including 'moved to another filesystem') IN_MOVE_SELF watched file/directory was itself moved IN_UNMOUNT when filesystem backing watches is umounted IN_IGNORED watch is removed (e.g. directly after a delete, unmount, move(verify))
fanotify
Availability: Linux
API that mostly just intercepts open, read, write, and close syscalls (note: not create, delete, move).
Applies to all objects on a filesystem.
When your aim is to get reports of all changes in directory trees (rather than specfic files for changes),
this can be more lightweight than inotify(verify).
As an API this also lets you insert code before the real access happens, or e.g. delay or deny access, can be handy for use cases like virus scanning).
kqueue
Availability: BSD, OSX
See also:
FSEvents
Availability: OSX 10.7 (Lion)(verify)
File System Events, a.k.a. fsevents, fseventsd
Libraries - some of which talk to multiple underlying APIs
fam
Doesn't scale very well (TODO: explain which conditions)
See also:
Gamin
Separate implementation of a subset of FAM.
On Linux it uses inotify or dnotify, on BSD it uses kqueue/kevent.
Doesn't always scale very well (TODO: explain which conditions)
See also:
libevent
Released in 2000
Frontend to select, poll, epoll, kqueue and/or /dev/poll
See also:
libev
Released in 2007
"Full-featured high-performance event loop loosely modelled after libevent", and has emulation layer for libevent, and was apparently meant as a cleaner alternative.
Can use epoll, kqueue, select, poll (no windows)
See also:
libuv
Developed for node.js to abstract non-blocking IO, now used more widely.
Can use epoll, kqueue, IOCP, event ports.
See also:
http://docs.libuv.org/en/v1.x/
libae
Made for redis.
Can use epoll, kqueue, event ports, select
Boost asio
CLI tools
fswatch
Basically unifies:
- inotify (Linux),
- File System Events (OSX),
- kqueue (BSD, OSX),
- Solaris/Illumos File Events Notification,
- stat()-based scan-and-figgerence
- ReadDirectoryChangesW (Windows)
For example:
fswatch -0 /var/log/ | xargs -0 -n 1 ls -l
See also:
inotifywait, inotifywatch
inotifywait
- wait for a change of the type specified, then
- exits (default), or
- prints out (-m / --monitor)
When you want to parse the output, -q is useful to remove the "establishing watches" and such, -e to listen for specific events, and use something like read to separate out the filename and event name fields, for example:
inotifywait -m -q -e modify /var/log/syslog | \ while read -r filename event; do ls -l ${filename} done
inotifywatch - establishes watches with inotify, counts and summarizes events received (for some time or until a signal like Ctrl-C)
- useful for filesystem usage statistics
Example: (note that -r will watch only directories, not files, not because it couldn't but because it's easy to run out of watches)
inotifywatch -r /var/log
Then after a Ctrl-C some time later:
total access modify close_write close_nowrite open filename 2088 1553 249 1 142 143 /var/log/ 1064 345 273 0 223 223 /var/log/apache2/ 96 48 0 0 24 24 /var/log/dist-upgrade/ 51 6 31 4 3 7 /var/log/munin/ 48 24 0 0 12 12 /var/log/samba/cores/ 39 15 0 0 12 12 /var/log/cups/ 38 18 0 1 9 10 /var/log/samba/ 36 12 0 0 12 12 /var/log/mysql/ 34 10 14 0 5 5 /var/log/journal/16e361e7d1b2e06a9a71a09e544c11b1/ 32 16 0 0 8 8 /var/log/journal/ 24 12 0 0 6 6 /var/log/installer/ 14 6 2 0 3 3 /var/log/atop/
watchman
inotify, FSEvents/kqueue for OSX, and windows in theory
https://facebook.github.io/watchman/
Windows
- Completion Ports (IOCP)
- very nice for watching specific files or connections, not so much for directories (verify)
- C++: FindFirstChangeNotification + ReadDirectoryChangesW
- under some well known but complex-to-describe conditions it will fail to send some notifications
- so it doesn't scale very well
- .NET's FileSystemWatcher
- seems to be built directly on the above, so equally flaky
- ...so you may want to use periodic scans to make sure you're up-to-date.
- USN Change Journals
- https://msdn.microsoft.com/en-us/library/aa363798(v=vs.85).aspx
- https://msdn.microsoft.com/en-us/library/aa363803(VS.85).aspx
- file system filter
- basically a kernel-level driver, so must be good quality.
See also:
See also