File synchronization notes
Stuff vaguely related to storing, hosting, and transferring files and media: |
Syncing wishes
unidirectional versus bidirectional
One of the major choices in synchronization is uni- or bi-directional:
unidirectional synchronization means one place that has a master copy,
and another location (or multiple) that should be made to look exactly like it.
This usually means that every update is free to overwrite and delete, so you should never use the mirrored host to store your work or new alterations.
Unidirectional is a simple choice for backups, particularly the automated and off-site kind, because it has fewer exceptional situations that require intervention.
(Various implementations will want to avoid duplicate work, and e.g. assume that something with the same timestamp is up to date)
Bidirectional synchronization means two (or sometime more) locations are all working copies, and changes in any one should become current in the other(s).
Which can cause more sorts of conflicts, as you can imagine. Consider for example browser bookmark synchronization, and a case where it notices that an entry is present in one and absent in another copy. How do you know whether it was created in one, or deleted in the other?
'This is newer' is mostly necessary anyway, and go a long way to resolve such conflicts automatically. (often requires either knowing the relative times between clients, or perhaps uses revision tracking). This typically requires extra bookkeeping, which may have to be done at a point of arbitration (that has at least some memory of revisions).
There are other potential issues to address, such as multiple copies saying they have a variant of the data. In some cases you may want automatic resolving policies (such as "throw away all data but the one that says it's the newest"), in other cases you may to deal with conflicts interactively (in the interest of not losing data, so even if it's a tedious process).
Bandwidth efficiency
Bandwidth (and read/write speed) is typically scarcer than storage size, so for many use cases there is quite a bit you can save by sending only differences.
Systems may choose to add a bunch of bookkeeping, to do this while also minimizing local IO (these systems are often based on content hashes or diff algorithms).
Syncing software
rsync notes
The most basic use of rsync is like any copy command, e.g. rsync [options] source dest
- ...where one side (or both) can be a remote host, making it a more generic copy tool, also useful for site mirrors, some kinds of backups, etc.
Rsync copies synchronizes file sets, and file contents.
- often between hosts
- but it has local uses.
- often used to "make the other side look more like this side" / "..contain at least what this side contains"
- but has more variations in how symmetric/asymmetric it is.
rsync is moderately clever about sending only differences, both:
- which files are selected for transfer
- by default: skips files with identical size and modification time on both sides
- skipping file contents which are the same (on network transfers)
- reads both sides, uses a rolling hash to detect (mainly) when things were appended to files
- (this can be clever about identical files and appended files, though often not about more complex changes)
So when sets of files are mostly the same as last time, relatively little work and relatively little transfer is necessary.
It also means you can continue a broken-off transfer (...mostly just by running the same rsync command again - with a few footnotes).
When you pay for transfers, lessening bytes transferred can help.
If your disk is faster than your network, you save time. (It comes from a time when network speed was much lower than disk speed. That difference is now smaller, but will probably always exist)
Many people will appreciate at least one of those.
Arguments
trailing slash
presence/absence of a trailing slash on the source is significant
# For example
rsync -a src/ dst # copies the entries _within_ src into the dst dir
rsync -a src dst # creates a directory called src inside the dst directory
basic arguments you'll use (-a and beyond)
The most important (/laziest) argument is probably -a because it sets and handful options you generally want
Beyond that, rsync is a tool with so many arguments that you end up picking a few you like.
For example, I...
- typically use -ai (archive mode for recursion and metadata preservation, i for a summary of what and why (-i is shorter than -v for verbose))
- often use --n (--dry-run) to check what files it would select, without changing anything yet.
- because a few seconds extra up front checking mistakes in path are are better than cleaning up a mess
- sometimes use --progress to check it's as fast as I want
- {{comment|(mentions speed of individual transfers only, which is less useful on small files.
I more rarely use/remember:
- that you can filter what files get included or excluded
- that when updating large files, --partials and --inplace can matter
- that there is an option to use in the presence of hardlinks
- that adding --info=progress2 to --progress also gives an overall speed
- that --stats makes me feel better about saving a few bytes of transfer
-i (--itemize-changes)
Gives an eleven-character summary of changes for each file/dir
Examples:
- .d..t...... is a directory that changed time (will happen for directories that you work in)
- <f.st...... is an file we are transferring because it changed size and timestamp
- <f+++++++++ is a new file
- cd+++++++++ is a new directory
- cLc.t...... is a new symlink
- .L..t...... is a symlink with a changed timestamp
- .f...p..... changed only permissions (not size/timestamp)
- .f....og... changed owner and group (not size/timestamp)
- *deleting when --delete decides to do so.
In more details:
- first column is the basic action
- . no content update (attibutes may still be set)
- > the item is received from remote shost
- < the item is sent to remote host
- h hardlinking (only when --hard-links specified)
- c creating dir / changing symlink
- * message follows
- the second is the entry type
- f file
- d directory
- L symlink
- D device
- S special file (e.g. named sockets and fifos)
The further columns are either present or shown as .
- c checksum is different (only when using --checksum)
- s size is different, will be transferring content
- t updating timestamp (or T when it will be set to the transfer time due to not using --times / -a)
- p updating permissions
- o updating owner
- g updating group
- u (reserved)
- a updating ACL
- x updating xattr
--stats, and on reading those stats
Use of --stats gives you something like:
Number of files: 133548 Number of files transferred: 5896 Total file size: 177152625 bytes Total transferred file size: 16682640 bytes Literal data: 16682640 bytes Matched data: 0 bytes File list size: 3147986 File list generation time: 0.001 seconds File list transfer time: 0.000 seconds Total bytes sent: 154564 Total bytes received: 20201367 sent 154564 bytes received 20201367 bytes 292891.09 bytes/sec total size is 177152625 speedup is 8.70
What I consider most interesting (I bolded the parts I got that from):
- The source held ~133K files
- totaling 169MiB
- ~5900 files were selected for transfer (so ~127K files were skipped because they were considered identical)
- updating those ~5900 was done using ~19MiB of network transfer (of which ~16MiB was actual file data, so ~3MB was rsync)
- since we transferred ~19MiB instead of the 169MiB (that a naive transfer would have used), we are about 8.7 times as efficient, network-bandwidth-wise
What to select and skip
There are two steps to an rsync transfer:
- the source figures out which files are selected for transfer (default: files that have different size or different modification time)
- check-and-transfer each selected file (using rolling hashes so that only changes are transferred)
by filesystem metadata
By default, rsync completely skips files that match in both size and mtime,
Equivalently, it selects everything that has changed size or last-modified time.
This is pretty good default behaviour, selecting only files that have been touched.
The only thing this has to assume is that timestamps are always updated on the destination - which is why you typically want -t, which you typically get via -a.
Alternatives for this part of the select/skip logic:
- -I or --ignore-times - "do not use the default behaviour of skipping based on matching mtime+size"
- basically meaning "consider everything for transfer" (verify)
- e.g. useful when you did a copy without timestamps, and want to now update timestamps - but only after verifying the files are correct
- -u - skip files that are newer (by mtime) on sender than on the receiver.
- (generally, avoid combining this with --inplace: a broken-off transfer will usually mean the half-updated file has a new timestamp. If old and new versions had the same size, it would not be selected/continued the next run, even though it probably should have)
by content
-c do a pre-transfer full checksum on both ends (unrelated to the rolling checksum done during the transfer) to base selection on
- if the file would get selected for transfer for size/mtime reasons anyway, then this is pointless and only ever slower
You would probably mainly do this if
- you don't quite trust mtime selects,
- you just had a broken off -u --inplace and want to be sure that's fine
so really want a more thorough check of what to select and prefer to spend disk IO on both ends over just retransferring everything.
Also note that the -c --dry-run combination (e.g. -cavin) is basically a directory-recursive (remote) content diff.
selecting/skipping filenames, directories
- -r or --recursive - recurse into directories. Without this (not: implied in -a) rsync will skip directories it sees.
- When using filters, think about how it combines with -r (verify)
- --files-from=FILE - use a FILE for the exact list of files to transfer (can be - for stdin)
- -d or --dirs - transfer specified directories without recursing. (unless you specify a directory that either is exactly "." or ends with a slash). Some other options imply this. (verify)
- -x or --one-file-system - While recursing, do not follow symlinks into other filesystems
Path filters/rules are useful when you want to select a subset of the files in the source tree, and can describe it well using some patterns.
You specify rules like -f"RULE" or --filter="RULE", where a rule consists of a type an an argument.
The first matching filter applies, so order matters (short circuits).
There are nine different rule types, but you'll probably just use the two most common:
- -f "+ pattern" for 'want'
- -f "- pattern" for 'do not want'.
- Pattern notes
Notes:
- you can use shell-like globs, *, ?, and things like [135]
- * string-matches stops at slashes (** does not)
- ? matches everything except a slash
- [] is regexp-like (including some class names (verify))
- on slashes:
- starting with a / anchors the match to the root of the transfer (unless it is a per-directory rule)
- ending with a / will match a directory's name
- a slash elsewhere will match it against the full pathname (not just under the root either)
- Examples
"copy everything except .rec files"
-f"- *.rec"
"Copy only directory structure, no files" (remember both the short circuiting, and the slash logic)
-f"+ */" -f"- *"
"Copy the directory structure and any .jpg files in it":
-f"+ */" -f"+ *.jpg" -f"- *"
Note: without the + */ it wouldn't select the directory structure itself.
"Don't copy .svn directories": (note: -C / --cvs-exclude understands svn and various others)
-f"- .svn/"
Or perhaps:
-f"- **/.svn"
files-from
- -a, --archive - shorthand for -rtpgolD, approximately "recursive copy and preserve most metadata":
- -r - recursive (without -r you'll see "skipping directory" for each directory), but see the section on filters when
- -t - preserve modification times
- -p - try to preserve permissions (as in rwxrwxrwx details. Without this you get the receiving side's umask/ACL policy(verify))
- -g (or --group) - preserve group
- -o - preserve owner
- -l - try to preserve symlinks
- -D - preserve devices/specials (equivalent to combination of --devices and --specials)
This excludes a few other things you may care about:
- -A - update ACLs
- -X - update xattrs
- -H - try to hardlink as on the source (default is treating them as separate files. In part because finding hardlinks is expensive)
- -E - preserve execution (presumably for when you don't do -p(verify))
- -S - handle sparse files efficiently if possible
More tweaking of the synchronization logic, and actions around it:
- -R or --relative - controls how the target directory structure should be used based on the sender's command/directory(verify)
- --force - force replacements of non-empty directories (e.g. to be able to replace a directory with a same-named file)
- -O or --omit-dir-times - omit directories from -t/--times
- --delete
- receiver deletes files that do not exist on the sending side
- One typo can mean a lot of data is gone, so you probably want to dry-run this first
- Only deletes in directories that are being considered for transfer
- There are further options related to this, e.g. combination with excluded files
- Good for unidirectonal archival copies that you want to keep exactly like their masters, because typically you're saying "I don't care about any extra data in the target, I want to look exactly like the source"
- May be okay for incremental backups, in that it cleans things up, but think about the limitations
- Bad for centralized syncing (as in rarely the behaviour what you'ld want)
- one nice side effect is that it cleans up stale rsync tempfiles
- receiver deletes files that do not exist on the sending side
- --remove-source-files removes files after transfer, i.e. makes it a move rather than a copy (but think about what you do with this one)
- -W / --whole-file - instead of the default diffing-update transfer cleverness, just copy the entire file
- can be faster when the network is not the bottleneck. (It also makes more sense when source and destination are both local filesystems - but whole-file copies are default behaviour for that case anyway)
- --append - assume data so far is right, continue after it. implies --inplace
- Useful for very large files
Network options
Related to the (network) transfer:
- --bwlimit=KBPS - limit the speed at which we send data, often to avoid saturating your outgoing bandwidth
- -z or --compress - compress file data
- --skip-compress=LIST - specify file suffixes that should not be compressed even when you use -z.
- The default is gz/zip/z/rpm/deb/iso/bz2/t[gb]z/7z/mp[34]/mov/avi/ogg/jpg/jpeg.
- You could specify a longer list as applicable to your content (e.g. gz/bz2/t[gb]z/tbz2/z/rpm/deb/zip/rar/7z/avi/mpg/mpeg/mp3/mp4/mkv/ogg/wmv/wma/mov/jpg/jpeg/png/raw/cr2/dng
Errors
unexplained error / timeout in send/receive
rsync: connection unexpectedly closed (3324523 bytes received so far) rsync error: unexplained error (code 255) at io.c(463)
...and if I use --timeout, it seems to structurally become:
rsync error: timeout in send/receive (code 30) at io.c(171)
Network related, apparently specifically ssh-related.
If bytes received/sent is 0, it may be an auth problem.
If the command worked before you moved it to cron, then it may be related to ssh keys (being in the wrong account - user crontabs may make life simpler).
If bytes received/sent is largeish, then it's likelier to be an intermittent connectivity problem.
protocol version mismatch — is your shell clean?
usually means your shell startup (~/bashrc, ~/bash_profile, ~/.ssh/rc, or whatnot) is printing things (on stdout).
Which for 'we are automating things on top of a shell login' things like this is a problem. Make it be quiet (the interactive shell distinction can matter in that you can make that output conditional).
Sometimes that output may be specific to the user you're logging in as.
failed to set times on “/a/path”: Operation not permitted
rsync: mkstemp failed: Operation not permitted (1)
error in rsync protocol data stream (code 12) at io.c
This seems to include various possible errors on the receiving side, from
- exceeding quota,
- to a filesystem issue,
- to ssh rejecting the key,
- to "the thing you handed into -e did not execute"
- to rsync not being installed on the receiving side (would have other indications?)(verify)
(verify)
There is a number after the io.c that is probably a line number, and may theoretically help, except that's probably changeable between versions.
- 231 currently seems to mean "network gave us EOF, which makes no sense"[1]
- and points to a more
mkstemp failed: Permission denied (13) [closed]
default_perms_for_dir: sys_acl_get_file(A_PATH, ACL_TYPE_DEFAULT): Permission denied, falling back on umask
Further notes
On spaces in paths
Paths with spaces will easily be confusing.
Two solutions:
- escape the whitespaces in a way that the remote shell understands. For example, in:
rsync 'example.com:/path/with\ spaces/' /local/path/
- the quotes are so that our local shell sees that thing as one argument
- the backslashes will be sent to, and interpreted by, the remote end
- use --protect-args (-s), which sends argument values and path specs to the remote end and tells it not to interpret them.
- This is useful for arguments that contain spaces, wildcards, and such. Wildcards will still be expanded on the remote end -- but by rsync, not the shell.
non-*nix filesystems (e.g. NTFS, FAT)
- permissions will never match (because different systems), so if you use -p (and potentially similar from -o, -g), e.g. via -a, you'll see all files selected
- you can avoid that with:
--no-o --no-g --no-p
- (--no-SOMETHING cancels out single implied options (e.g. -rtpgolD from -a)
- timestamp resolution may cause files to be selected unduly. Consider making the comparison fuzzy:
- NTFS: small differences, so a --modify-window=1 is enough (and overkill)
- FAT: accurate to 2 seconds, so --modify-window=2 may be necessary, though 1 is often enough (verify)
- note that NTFS does support hardlinks. FAT does not
- if you want hardlinks, use -H (rsync in general does not preserve hardlinks by default, because finding them is expensive)
- Consider what you want to happen around symlinks. See e.g. the SYMBOLIC LINKS header in the man page
Listing which files are different
If you want to inspect which files have probably changed, this is mostly covered by the default test for "what will be selected for transfer".
In other words, an itemized dry run -n -i is pretty informative answer to "what would I add/update from src to dest"
...to make all files present on src the same on desc. To also report what dest-size additions would be removed, do an -n --delete Warning: Only remove the -n/--dry-run after a sanity check.
On ssh, rsh, and the daemon
A single colon implies connecting via remote-shell access. Historically this was rsh, nowadays it is typically ssh.
For example:
rsync remoteuser@remotehost:/remote/dir localcopy
A double colon and a path that starts with rsync:// says you want to connecting to an rsync daemon. Setting rsync up as a daemon may copy a little faster (no encryption overhead) but it is typically not so convenient: you have to figure out authentication and firewalling yourself.
For example:
rsync localpath/ remoteuser@remotehost::/remote/dir rsync localpath/ rsync://remoteuser@remotehost:/remote/dir
Notes:
- In all cases, the rsync binary needs to be present on both sides.
- To specify a remote shell, use -e or --rsh (or by setting the RSYNC_RSH environment variable, and/or further trickery via RSYNC_CONNECT_PROG and such), such as:
rsync -e ssh remoteuser@remotehost:/remote/dir localcopy # or (non-default port) rsync -e 'ssh -p 2222' remoteuser@remotehost:/remote/dir localcopy # or even (for speed on LANs, see SSH_-_loose_notes#On_bandwidth,_and_high-bandwidth_networks) rsync -e 'ssh -c arcfour -o Compression=no -x' remoteuser@remotehost:/remote/dir localcopy
On continuing files (partial, inplace, temporary files and more)
tl;dr:
- default is safe, and slow
- add --partial --partial-dir=something to continue from a gracefully broken off copy
- use --inplace when you care about being able to continue the transfer of individual files (because they are large)
- least IO, but be aware other programs looking at the file will see intermediate nonsense state
When network speed is noticeably slower than disk speed, check-and-continue is nice (to speed and network use) because verifying file content happens at disk speed.
Roughly:
- default behaviour is to write a new file (temporary file), to swap in. This is the default because it is the safest behaviour
- When rsync continues from a partial, it copies content from the partial file to a new temporary file.
- can be a lot of IO work, which is easily avoidable when you don't need the replacement behaviour.
- Gracefully breaking of a transfers means temporary files are thrown away (!).
- non-gracefully means you've got a hidden file never looked at again, and taking space
- (continue only happens from the file with its final name, not from a leftover temporary file)
- --partial
- On graceful breakage, an unfinished temporary file is moved to the target path.
- if it wasn't there, that allows for a continue (inplace or not)
- if it was there, you've effectively truncated it
- (potentially throwing away a lot of data that may have been identical)
- --partial with --partial-dir
- on graceful breakage, copies to a file in a specific subdir
- which is considered for continue if the next run specifies the same --partial-dir
- --inplace
- writes/updates the target file directly (instead of making a temporary file)
- compared to the default, this is faster for large files (the least IO)
- (as the default is to create a complete copy and then removing one)
- but don't use the files while transferring
Terminology:
- temporary file: a dotfile that rsync writes
- to eventually replace the real file it is replacing
- This is safe behaviour, hence the default.
- (if the target existed and was mostly correct, this is mostly a disk-speed copy operation)
- Partial file is the final name, the path that a finished file will be renamed/moved to - which may exist already.
- called this because rsync will only consider check-and-continue transfer when it has a partial file to start from.
- it will not continue from a temporary file that wasn't cleaned up. (...but you could choose to rename it, to save some time)
- (There is one case where the partial file is distinct from the file's final destination: if you use --partial and --partial-dir, the partial file will be a separate file written on graceful break-off)
Another important distinction is how rsync can be broken off.
It can be broken off gracefully/gently, which includes:
- network timeout by rsync (if you have set a timeout -- by default rsync waits indefinitely)
- killing the receiving process/daemon (verify)
- Ctrl-C when we send data elsewhere (verify)
It can also be killed non-gently, which includes
- Ctrl-C if copying from a remote host? (verify)
- On timeout
While it is useful to specify a timeout rather than wait indefinitely, you often still want a large value to be safe.
The test seems to be "time without transferred content", not "time spent idling", so hard pre-transfer work (e.g. for large matching files) will count towards the timeout too.
Notes
" If you use --delete and --timeout, you may want --delete-during behaviour rather than the default --delete-before
-->
On speed
- Encryption overhead
See the mention of arcfour above, and why.
tl;dr: you can often gain some speed using -e 'ssh -c arcfour'
- Parallel rsync
If speeds are limited due to an older rsync with small buffers (and a higher-latency remote host), then doing multiple transfers will get you more speed.
I do this when I can easily make filter-based rules to split things up.
In theory there are some xargs tricks that can make that easier. (TODO: play with that)
Small files first
You cannot sort by size (unless you do it externally yourself and hand rsync the sorted list), but you you can e.g. do a few incremental steps like:
--max-size=50K --max-size=200K --max-size=1M
Moving instead of copying
Non-unix filesystems
Possible issues:
- user cannot write timestamps (e.g. non-root, also depending on how it was mounted)
- filesystems with lower-resolution timestamps means files are always selected (e.g. FAT)
- reduce comparison accuracy. Something like --modify-window=2 should cover most cases
- -c (--checksum) is safer but takes longer.
- If you know these files don't change, then --size-only may be simpler.
- lots of errors trying to preserve permissions, owners, links, special files, timestamps (e.g. FAT has only one timestamp)
- avoid -a if you want to avoid these. Exactly which options to keep varies per filesystem
- If you are copying e.g. from cygwin-on-windows to *nix, then trying to preserve permissions is probably messy at best.
- You might like to
- instead of just --perms meaning "try to reproduce",
- do something like --perms --chmod=Du=rwx,Dgo=rx,Fu=rw,Fog=r ...which tells the client side to, override what the sending side would have set (which is also why it has no effect without --perms) with these specified permission bits (D and F specify directories and files).
Related and similar software
Other utilities also use the rsync algorithm (often using librsync) and some more complex incremental sync / backup systems. These include
- rdiff
- rsnapshot - useful for backups, where it can save a lot of space by using hardlinks when files are identical.
- Unison - rsync-like
- syncrify - reimplementation with some rsync logic in it
rsync on windows
Note:
- pretty much all real options use cygwin as a basis
- which implies windows paths need to be translated to unix-style paths, with cygwin's mapping of drives, e.g.
C:\path\to\file
becomes
/cygdrive/c/path/to/file
- cygwin itself
If you already have a cygwin install, you can install rsync in it.
If you don't, cygwin is probably overkill (and a bit of a learning curve), and you can use one of the following:
- cwRsync
This packages rsync with just enough of cygwin to work. (the free cwRsync version is just that, the paid version adds a GUI)
Personally, I ended up using this one, with a batch file containing something like:
rsync -r -t -v --progress -e "C:\cwRsync\ssh.exe -p 22" user@host:/data/Docs/ /cygdrive/d/Docs pause
- Grsync
minimal-rsync-via-cygwin plus a GTK interface (win+lin+bsd).
The cygwin stuff is more transparent, with the path rewrite done for you.
When it works it's nice, but for some it freezes frequently, and its interaction with the underlying rsync.exe seems a bit iffy.
Additional options you may want:
-e "ssh -p 22"
- DeltaCopy
Wraps rsync and cygwin, add its own GUI (e.g. includes a scheduler).
Client-server style, uses the rsync protocol/port.
See also