File synchronization notes

From Helpful
Jump to navigation Jump to search

Stuff vaguely related to storing, hosting, and transferring files and media:


Syncing wishes

unidirectional versus bidirectional

One of the major choices in synchronization is uni- or bi-directional:


unidirectional synchronization means one place that has a master copy, and another location (or multiple) that should be made to look exactly like it.

This usually means that every update is free to overwrite and delete, so you should never use the mirrored host to store your work or new alterations.

Unidirectional is a simple choice for backups, particularly the automated and off-site kind, because it has fewer exceptional situations that require intervention.

(Various implementations will want to avoid duplicate work, and e.g. assume that something with the same timestamp is up to date)


Bidirectional synchronization means two (or sometime more) locations are all working copies, and changes in any one should become current in the other(s).

Which can cause more sorts of conflicts, as you can imagine. Consider for example browser bookmark synchronization, and a case where it notices that an entry is present in one and absent in another copy. How do you know whether it was created in one, or deleted in the other?

'This is newer' is mostly necessary anyway, and go a long way to resolve such conflicts automatically. (often requires either knowing the relative times between clients, or perhaps uses revision tracking). This typically requires extra bookkeeping, which may have to be done at a point of arbitration (that has at least some memory of revisions).

There are other potential issues to address, such as multiple copies saying they have a variant of the data. In some cases you may want automatic resolving policies (such as "throw away all data but the one that says it's the newest"), in other cases you may to deal with conflicts interactively (in the interest of not losing data, so even if it's a tedious process).

Bandwidth efficiency

Bandwidth (and read/write speed) is typically scarcer than storage size, so for many use cases there is quite a bit you can save by sending only differences.


Systems may choose to add a bunch of bookkeeping, to do this while also minimizing local IO (these systems are often based on content hashes or diff algorithms).


Syncing software

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)



rsync notes

Rsync synchronizes sets of files, and file contents.

Usually between hosts, but it has local uses.
It seems mostly used to "make the other side look more like this side" / "..contain at least what this side contains", but has more variations in how symmetric/asymmetric it is.


The most basic use of rsync is like any copy command, e.g. rsync [options] source dest

...though one or both sides can be a remote host, making it a more generic copy tool between local disks, or remote hosts, and can work well to update backups, create site mirrors, etc.


rsync is also clever about sending only differences, both:

  • which files are selected for transfer
by default: skips files with identical size and modification time on both sides
  • skipping file contents which are the same (on network transfers)
reads both sides, uses a rolling hash to detect (mainly) when things were appended to files
(this can be clever about identical files and appended files, though often not about more complex changes)

So when sets of files are mostly the same as last time, relatively little work and relatively little transfer is necessary.

It also means you can continue a broken-off transfer (...mostly just by running the same rsync command again - with a few footnotes).


When you pay for transfers, lessening bytes transferred can help.

If your disk is faster than your network, you save time. (It comes from a time when network speed was much lower than disk speed. That difference is now smaller, but will probably always exist)

Many people will appreciate at least one of those.






Arguments

trailing slash

presence/absence of a trailing slash on the source is significant

# For example
rsync -a src/ dst   # copies the entries _within_ src into the dst dir
rsync -a src  dst   # creates a directory called src inside the dst directory


basic arguments you'll use (-a and beyond)

The most important (/laziest) argument is probably -a because it sets and handful options you generally want


Beyond that, rsync is a tool with so many arguments that you end up picking a few you like.

For example, I...

  • typically use -ai (archive mode for recursion and metadata preservation, i for a summary of what and why (-i is shorter than -v for verbose))
  • often use --n (--dry-run) to check what files it would select, without changing anything yet.
because a few seconds extra up front checking mistakes in path are are better than cleaning up a mess
  • sometimes use --progress to check it's as fast as I want
{{comment|(mentions speed of individual transfers only, which is less useful on small files.


I more rarely use/remember:

  • that you can filter what files get included or excluded
  • that when updating large files, --partials and --inplace can matter
  • that there is an option to use in the presence of hardlinks
  • that adding --info=progress2 to --progress also gives an overall speed
  • that --stats makes me feel better about saving a few bytes of transfer
-i (--itemize-changes)
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Gives an eleven-character summary of changes for each file/dir

Examples:

  • <f.st...... is an file we are transferring because it changed size and timestamp
  • <f+++++++++ is a new file
  • cd+++++++++ is a new directory
  • cLc.t...... is a new symlink
  • .L..t...... is a symlink with a changed timestamp
  • .d..t...... is a directory that changed time (will happen for directories that you work in)
  • .f...p..... changed only permissions (not size/timestamp)
  • .f....og... changed owner and group (not size/timestamp)
  • *deleting   when --delete decides to do so.


Details:

  • first column is the basic action
. no content update (attibutes may still be set)
> the item is received from remote shost
< the item is sent to remote host
h hardlinking (only when --hard-links specified)
c creating dir / changing symlink
* message follows
  • the second is the entry type
f file
d directory
L symlink
D device
S special file (e.g. named sockets and fifos)


The further columns are either present or shown as .

  • c checksum is different (only when using --checksum)
  • s size is different, will be transferring content
  • t updating timestamp (or T when it will be set to the transfer time due to not using --times / -a)
  • p updating permissions
  • o updating owner
  • g updating group
  • u (reserved)
  • a updating ACL
  • x updating xattr
--stats, and on reading those stats

Use of --stats gives you something like:

Number of files: 133548
Number of files transferred: 5896
Total file size: 177152625 bytes
Total transferred file size: 16682640 bytes
Literal data: 16682640 bytes
Matched data: 0 bytes
File list size: 3147986
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 154564
Total bytes received: 20201367

sent 154564 bytes  received 20201367 bytes  292891.09 bytes/sec
total size is 177152625  speedup is 8.70


What I consider most interesting (I bolded the parts I got that from):

  • The source held ~133K files
  • totaling 169MiB
  • ~5900 files were selected for transfer (so ~127K files were skipped because they were considered identical)
  • updating those ~5900 was done using ~19MiB of network transfer (of which ~16MiB was actual file data, so ~3MB was rsync)
  • since we transferred ~19MiB instead of the 169MiB (that a naive transfer would have used), we are about 8.7 times as efficient, network-bandwidth-wise


What to select and skip

There are two steps to an rsync transfer:

  • the source figures out which files are selected for transfer (default: files that have different size or different modification time)
  • check-and-transfer each selected file (using rolling hashes so that only changes are transferred)


by filesystem metadata

By default, rsync completely skips files that match in both size and mtime,

Equivalently, it selects everything that has changed size or last-modified time.


This is pretty good default behaviour, selecting only files that have been touched.

The only thing this has to assume is that timestamps are always updated on the destination - which is why you typically want -t, which you typically get via -a.


Alternatives for this part of the select/skip logic:

  • -I or --ignore-times - "do not use the default behaviour of skipping based on matching mtime+size"
basically meaning "consider everything for transfer" (verify)
e.g. useful when you did a copy without timestamps, and want to now update timestamps - but only after verifying the files are correct
  • -u - skip files that are newer (by mtime) on sender than on the receiver.
(generally, avoid combining this with --inplace: a broken-off transfer will usually mean the half-updated file has a new timestamp. If old and new versions had the same size, it would not be selected/continued the next run, even though it probably should have)




by content

-c do a pre-transfer full checksum on both ends (unrelated to the rolling checksum done during the transfer) to base selection on

if the file would get selected for transfer for size/mtime reasons anyway, then this is pointless and only ever slower


You would probably mainly do this if

  • you don't quite trust mtime selects,
  • you just had a broken off -u --inplace and want to be sure that's fine

so really want a more thorough check of what to select and prefer to spend disk IO on both ends over just retransferring everything.


Also note that the -c --dry-run combination (e.g. -cavin) is basically a directory-recursive (remote) content diff.

selecting/skipping filenames, directories
  • -r or --recursive - recurse into directories. Without this (not: implied in -a) rsync will skip directories it sees.
    • When using filters, think about how it combines with -r (verify)
  • --files-from=FILE - use a FILE for the exact list of files to transfer (can be - for stdin)


  • -d or --dirs - transfer specified directories without recursing. (unless you specify a directory that either is exactly "." or ends with a slash). Some other options imply this. (verify)
  • -x or --one-file-system - While recursing, do not follow symlinks into other filesystems



Path filters/rules are useful when you want to select a subset of the files in the source tree, and can describe it well using some patterns.

You specify rules like -f"RULE" or --filter="RULE", where a rule consists of a type an an argument.

The first matching filter applies, so order matters (short circuits).


There are nine different rule types, but you'll probably just use the two most common:

-f "+ pattern" for 'want'
-f "- pattern" for 'do not want'.


Pattern notes

Notes:

  • you can use shell-like globs, *, ?, and things like [135]
    • * string-matches stops at slashes (** does not)
    • ? matches everything except a slash
    • [] is regexp-like (including some class names (verify))
  • on slashes:
    • starting with a / anchors the match to the root of the transfer (unless it is a per-directory rule)
    • ending with a / will match a directory's name
    • a slash elsewhere will match it against the full pathname (not just under the root either)


Examples

"copy everything except .rec files"

-f"- *.rec"  

"Copy only directory structure, no files" (remember both the short circuiting, and the slash logic)

-f"+ */"  -f"- *" 

"Copy the directory structure and any .jpg files in it":

-f"+ */"  -f"+ *.jpg"  -f"- *"

Note: without the + */ it wouldn't select the directory structure itself.

"Don't copy .svn directories": (note: -C / --cvs-exclude understands svn and various others)

-f"- .svn/"

Or perhaps:

-f"- **/.svn"


files-from

What to do at the destination side, and related sync semantics

  • -a, --archive - shorthand for -rtpgolD, approximately "recursive copy and preserve most metadata":


-r - recursive (without -r you'll see "skipping directory" for each directory), but see the section on filters when
-t - preserve modification times
-p - try to preserve permissions (as in rwxrwxrwx details. Without this you get the receiving side's umask/ACL policy(verify))
-g (or --group) - preserve group
-o - preserve owner
-l - try to preserve symlinks
-D - preserve devices/specials (equivalent to combination of --devices and --specials)


This excludes a few other things you may care about:

  • -A - update ACLs
  • -X - update xattrs
  • -H - try to hardlink as on the source (default is treating them as separate files. In part because finding hardlinks is expensive)
  • -E - preserve execution (presumably for when you don't do -p(verify))
  • -S - handle sparse files efficiently if possible


More tweaking of the synchronization logic, and actions around it:

  • -R or --relative - controls how the target directory structure should be used based on the sender's command/directory(verify)
  • --force - force replacements of non-empty directories (e.g. to be able to replace a directory with a same-named file)
  • -O or --omit-dir-times - omit directories from -t/--times


  • --delete
    • receiver deletes files that do not exist on the sending side
      • One typo can mean a lot of data is gone, so you probably want to dry-run this first
      • Only deletes in directories that are being considered for transfer
      • There are further options related to this, e.g. combination with excluded files
    • Good for unidirectonal archival copies that you want to keep exactly like their masters, because typically you're saying "I don't care about any extra data in the target, I want to look exactly like the source"
    • May be okay for incremental backups, in that it cleans things up, but think about the limitations
    • Bad for centralized syncing (as in rarely the behaviour what you'ld want)
    • one nice side effect is that it cleans up stale rsync tempfiles


  • --remove-source-files removes files after transfer, i.e. makes it a move rather than a copy (but think about what you do with this one)


  • -W / --whole-file - instead of the default diffing-update transfer cleverness, just copy the entire file
    • can be faster when the network is not the bottleneck. (It also makes more sense when source and destination are both local filesystems - but whole-file copies are default behaviour for that case anyway)
  • --append - assume data so far is right, continue after it. implies --inplace
    • Useful for very large files

Network options

Related to the (network) transfer:

  • --bwlimit=KBPS - limit the speed at which we send data, often to avoid saturating your outgoing bandwidth. I occasionally find this useful when omitting this means sluggish shell acces.
  • -z or --compress - compress file data. Useful for compressible data and over slow/limited connections (less so over very fast local ones, as you may lose more time compressing/decompressing than the transfer would have taken
  • --skip-compress=LIST - specify file suffixes that should not be compressed even when you use -z.
The default is gz/zip/z/rpm/deb/iso/bz2/t[gb]z/7z/mp[34]/mov/avi/ogg/jpg/jpeg.
You could specify a longer list as applicable to your content (e.g. gz/bz2/t[gb]z/tbz2/z/rpm/deb/zip/rar/7z/avi/mpg/mpeg/mp3/mp4/mkv/ogg/wmv/wma/mov/jpg/jpeg/png/raw/cr2/dng

Errors

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

unexplained error / timeout in send/receive

rsync: connection unexpectedly closed (3324523 bytes received so far) 
rsync error: unexplained error (code 255) at io.c(463)

...and if I use --timeout, it seems to structurally become:

rsync error: timeout in send/receive (code 30) at io.c(171)

Network related, apparently specifically ssh-related.


If bytes received/sent is 0, it may be an auth problem.

If the command worked before you moved it to cron, then it may be related to ssh keys (being in the wrong account - user crontabs may make life simpler).

If bytes received/sent is largeish, then it's likelier to be an intermittent connectivity problem.

protocol version mismatch — is your shell clean?

usually means your shell startup (~/bashrc, ~/bash_profile, ~/.ssh/rc, or whatnot) is printing things (on stdout).

Which for 'we are automating things on top of a shell login' things like this is a problem. Make it be quiet (the interactive shell distinction can matter in that you can make that output conditional).

Sometimes that output may be specific to the user you're logging in as.

failed to set times on “/a/path”: Operation not permitted

rsync: mkstemp failed: Operation not permitted (1)

error in rsync protocol data stream (code 12) at io.c(820)

This seems to include various possible errors on the receiving side, from exceeding quota, to a filesystem issue, to ssh rejecting the key (verify)


Further notes

On spaces in paths

Paths with spaces will easily be confusing.

Two solutions:

  • escape the whitespaces in a way that the remote shell understands. For example, in:
rsync 'example.com:/path/with\ spaces/' /local/path/
the quotes are so that our local shell sees that thing as one argument
the backslashes will be sent to, and interpreted by, the remote end
  • use --protect-args (-s), which sends argument values and path specs to the remote end and tells it not to interpret them.
This is useful for arguments that contain spaces, wildcards, and such. Wildcards will still be expanded on the remote end -- but by rsync, not the shell.


non-*nix filesystems (e.g. NTFS, FAT)

  • permissions will never match (because different systems), so if you use -p (and potentially similar from -o, -g), e.g. via -a, you'll see all files selected
you can avoid that with:
--no-o --no-g --no-p
(--no-SOMETHING cancels out single implied options (e.g. -rtpgolD from -a)
  • timestamp resolution may cause files to be selected unduly. Consider making the comparison fuzzy:
NTFS: small differences, so a --modify-window=1 is enough (and overkill)
FAT: accurate to 2 seconds, so --modify-window=2 may be necessary, though 1 is often enough (verify)


  • note that NTFS does support hardlinks. FAT does not
if you want hardlinks, use -H (rsync in general does not preserve hardlinks by default, because finding them is expensive)
  • Consider what you want to happen around symlinks. See e.g. the SYMBOLIC LINKS header in the man page

Listing which files are different

If you want to inspect which files have probably changed, this is mostly covered by the default test for "what will be selected for transfer".


In other words, an itemized dry run -n -i is pretty informative answer to "what would I add/update from src to dest"

...to make all files present on src the same on desc. To also report what dest-size additions would be removed, do an -n --delete Warning: Only remove the -n/--dry-run after a sanity check.

On ssh, rsh, and the daemon

A single colon implies connecting via remote-shell access. Historically this was rsh, nowadays it is typically ssh.

For example:

rsync remoteuser@remotehost:/remote/dir localcopy


A double colon and a path that starts with rsync:// says you want to connecting to an rsync daemon. Setting rsync up as a daemon may copy a little faster (no encryption overhead) but it is typically not so convenient: you have to figure out authentication and firewalling yourself.

For example:

rsync localpath/ remoteuser@remotehost::/remote/dir
rsync localpath/ rsync://remoteuser@remotehost:/remote/dir


Notes:

  • In all cases, the rsync binary needs to be present on both sides.
  • To specify a remote shell, use -e or --rsh (or by setting the RSYNC_RSH environment variable, and/or further trickery via RSYNC_CONNECT_PROG and such), such as:
rsync -e ssh remoteuser@remotehost:/remote/dir localcopy
# or (non-default port)
rsync -e 'ssh -p 2222' remoteuser@remotehost:/remote/dir localcopy
# or even (for speed on LANs, see SSH#On_high-bandwidth_networks)
rsync -e 'ssh -c arcfour -o Compression=no -x' remoteuser@remotehost:/remote/dir localcopy


On continuing files (partial, inplace, temporary files and more)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

tl;dr:

  • default is safe, and slow
  • add --partial --partial-dir=something to continue from a gracefully broken off copy
  • use --inplace when you care about being able to continue the transfer of individual files (because they are large)
least IO, but be aware other programs looking at the file will see intermediate nonsense state


When network speed is noticeably slower than disk speed, check-and-continue is nice (to speed and network use) because verifying file content happens at disk speed.


Roughly:

  • default behaviour is to write a new file (temporary file), to swap in. This is the default because it is the safest behaviour
When rsync continues from a partial, it copies content from the partial file to a new temporary file.
can be a lot of IO work, which is easily avoidable when you don't need the replacement behaviour.
Gracefully breaking of a transfers means temporary files are thrown away (!).
non-gracefully means you've got a hidden file never looked at again, and taking space
(continue only happens from the file with its final name, not from a leftover temporary file)


  • --partial
On graceful breakage, an unfinished temporary file is moved to the target path.
if it wasn't there, that allows for a continue (inplace or not)
if it was there, you've effectively truncated it
(potentially throwing away a lot of data that may have been identical)
  • --partial with --partial-dir
on graceful breakage, copies to a file in a specific subdir
which is considered for continue if the next run specifies the same --partial-dir


  • --inplace
writes/updates the target file directly (instead of making a temporary file)
compared to the default, this is faster for large files (the least IO)
(as the default is to create a complete copy and then removing one)
but don't use the files while transferring


Terminology:

  • temporary file: a dotfile that rsync writes
to eventually replace the real file it is replacing
This is safe behaviour, hence the default.
(if the target existed and was mostly correct, this is mostly a disk-speed copy operation)
  • Partial file is the final name, the path that a finished file will be renamed/moved to - which may exist already.
called this because rsync will only consider check-and-continue transfer when it has a partial file to start from.
it will not continue from a temporary file that wasn't cleaned up. (...but you could choose to rename it, to save some time)
(There is one case where the partial file is distinct from the file's final destination: if you use --partial and --partial-dir, the partial file will be a separate file written on graceful break-off)


Another important distinction is how rsync can be broken off.

It can be broken off gracefully/gently, which includes:

  • network timeout by rsync (if you have set a timeout -- by default rsync waits indefinitely)
  • killing the receiving process/daemon (verify)
  • Ctrl-C when we send data elsewhere (verify)

It can also be killed non-gently, which includes

  • Ctrl-C if copying from a remote host? (verify)


On timeout

While it is useful to specify a timeout rather than wait indefinitely, you often still want a large value to be safe.

The test seems to be "time without transferred content", not "time spent idling", so hard pre-transfer work (e.g. for large matching files) will count towards the timeout too.


Notes " If you use --delete and --timeout, you may want --delete-during behaviour rather than the default --delete-before


-->

On speed

Encryption overhead

See the mention of arcfour above, and why.

tl;dr: you can often gain some speed using -e 'ssh -c arcfour'



Parallel rsync

If speeds are limited due to an older rsync with small buffers (and a higher-latency remote host), then doing multiple transfers will get you more speed.

I do this when I can easily make filter-based rules to split things up.

In theory there are some xargs tricks that can make that easier. (TODO: play with that)


Small files first

You cannot sort by size (unless you do it externally yourself and hand rsync the sorted list), but you you can e.g. do a few incremental steps like:

--max-size=50K
--max-size=200K
--max-size=1M

Moving instead of copying

Non-unix filesystems

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Possible issues:

  • user cannot write timestamps (e.g. non-root, also depending on how it was mounted)
  • filesystems with lower-resolution timestamps means files are always selected (e.g. FAT)
reduce comparison accuracy. Something like --modify-window=2 should cover most cases
-c (--checksum) is safer but takes longer.
If you know these files don't change, then --size-only may be simpler.
  • lots of errors trying to preserve permissions, owners, links, special files, timestamps (e.g. FAT has only one timestamp)
avoid -a if you want to avoid these. Exactly which options to keep varies per filesystem

Related and similar software

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Other utilities also use the rsync algorithm (often using librsync) and some more complex incremental sync / backup systems. These include

  • rdiff
  • rsnapshot - useful for backups, where it can save a lot of space by using hardlinks when files are identical.


  • Unison - rsync-like
  • syncrify - reimplementation with some rsync logic in it

rsync on windows

Note:

pretty much all real options use cygwin as a basis
which implies windows paths need to be translated to unix-style paths, with cygwin's mapping of drives, e.g.
C:\path\to\file

becomes

/cygdrive/c/path/to/file


cygwin itself

If you already have a cygwin install, you can install rsync in it.

If you don't, cygwin is probably overkill (and a bit of a learning curve), and you can use one of the following:


cwRsync

This packages rsync with just enough of cygwin to work. (the free cwRsync version is just that, the paid version adds a GUI)

Personally, I ended up using this one, with a batch file containing something like:

rsync -r -t -v --progress -e "C:\cwRsync\ssh.exe -p 22" user@host:/data/Docs/ /cygdrive/d/Docs
pause


Grsync

minimal-rsync-via-cygwin plus a GTK interface (win+lin+bsd).


The cygwin stuff is more transparent, with the path rewrite done for you.

When it works it's nice, but for some it freezes frequently, and its interaction with the underlying rsync.exe seems a bit iffy.

Additional options you may want:

-e "ssh -p 22"



DeltaCopy

Wraps rsync and cygwin, add its own GUI (e.g. includes a scheduler).

Client-server style, uses the rsync protocol/port.


See also