Archive and backup notes

From Helpful
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Stuff vaguely related to storing, hosting, and transferring files and media:

Notes on...

Archives here typically means 'a collection of files serialized to a single file'. The "the best way to ensure a copy survives" sense is a topic in itself.


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Tar files (also known as tarballs) are store uncompressed collections of files, with basic metadata. You can hand it files and directories

compress a directory (and implicitly its contents), and have it report what files it's adding

tar czvf things.tar.gz dir/

decompress that to the now-current directory

tar xf things.tar.gz

list files in archive

tar tf things.tar.gz
(note: while these are options and equivalent to
-c -z -v -f
-x -f
-t -f
,    the minus can be omitted on the clumped form, because of historical reasons)


  • c means 'create archive'
  • x means 'extract archive'
  • v for verbose, "list files you're adding / extracting", useful e.g. to see tarfile creation is taking in the right files
  • f means 'archive filename follows as next argument'

  • z for gzip (for a .tar.gz / .tgz)
  • j for bzip2 (for a .tar.bz2 / .tbz2) - not available in older versions

On compression:

  • newer versions of tar detect gzip (and bzip2), meaning you can omit z (or j) when using reading archives
  • for larger files when compression is not necessary, it's faster to skip it
because there is no index, meaning e.g. listing is the speed of seeking, which is fast on uncompressed but on compressed is the speed of decompressing

On paths and permissions

On the same host, preserving UID/username, GID/groupname, and permission bits can be very handy, particularly when restoring a backup.

Across hosts, there is usually no meaning/point to this, and worst-case it is a security problem due to permission bits.

As such, you may like
when extracting, to make tar set the UID and GID of the unpacking user, instead of those stored in the archive.

There are a few related options such as (not) applying umask.

Filtering what to add

e.g. in "Make a backup of my website code, but exclude that huge image directory:"

tar czvf ~/backup/code-$(date +%F).tgz . --exclude 'img/*'


  • globs are used, so you probably want to have tar interpret them, not the shell, hence the singlequotes.
  • here meant for a daily file in a cronjob, as
    date +%F
    outputs something like 2017-04-22

For fancier filtering, you probably want find (or some shell-fu) to list things, as a "include only files listed in this file" (-T below, here piped in on stdin to avoid a temporary file) (or sometimes "exclude files listed in this file").

A practical example:

find ~/public_html/  ! -type d  ! -wholename '*/img/*'  -print0 | \
     tar -T- --null -cvjf ~/backup/code-`date +%F`.tar.bz2

...which you can put in your crontab to have a simple automated backup.


  • Read
    ! -wholename '*/img/*'
    as 'absolute path does not contain /img/'
  • ! -type d
    is there to prevent tar from its behaviour of recursing into directories you hand it (includes ".", which is easy to find using find)
Yes, --no-recursion on tar is also possible, and sometimes much simpler
  • For robustness against filenames with special characters, find outputs (-print0) and tar expects (--null) as delimiters between filenames.
You could also use grep/egrep in this filename pipe, as long as you use -azZ to make it expect and produce nulls.

More on GNU tar
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
  • --append
    (insteda of -c for create)
appends a (metadata,data) thing at the end of an existing tar file
can introduce entries with duplicate names
  • --update
compares entries to disk, then --appends files that have a newer timestamp
can introduce entries with duplicate names

  • --delete
removes named entries
will rewrite the archive, so don't do this on actual tape, and avoid with huge archives

Since there can be entries with duplicate names (see append, update)...

by default, extract does things in order, so will extract all, and silently overwrite the one(s) that existed / were extracted earlier
you can extract only a specific one, via --occurrence=number
you can also --delete a specific --occurrence
but will rewrite the archive to do so (slow on larger files, and doesn't work on compressed tars)


  • --verify
after a create/append/update's changes finish, reads all of the archive and compares its contents against disk
reports when (size, mode, owner, modification date, contents) differ
doesn't combine with compression
  • --compare
basically a verify that's not fixed to a create/append/update command
if no arguments given, it uses all filenames it find in the archive (verify). If arguments are given it compares only those filenames.
doesn't combine with compression


  • tar doesn't store checksums itself, so there is no direct integrity checks
  • you can create a checksum of the contents during cration of the archive
tar -cvpf bup.tar dir/ | xargs -replace '@' sh -c "test -f '@' && md5sum '@'" > bup.md5
...since tar -v outputs filenames after it's done with them. They'll also be read right after being added, so we'll oten read from the page cache instead of disk
there isn't shell-fu verifier to check the still-archived result. There are things like veritar that can do that.
  • note that decompressing a tar.gz (e.g. to dev-null) is a good check that the compressed stream was not corrupted by backing storage (not that you can do much about it if it were)
  • It seems that -w/--verify and --remove-files happens in that order, (verify) so should work as "delete from disk only if once verified to be in the archive" (verify)
tar: /path/to/something: file changed as we read it

And at the end a error exit delayed from previous messages)

tar does a stat before and after it handles a file or directory, and will say this if mtime or size differ.

  • When reported on files, this can easily mean we tarred a half-changed file.
In theory this can also happen when the tar being created is within the paths included (though it should notice this and say file is the archive; not dumped)
  • When reported on directories it's less serious, often meaning a file was added or removed since tar started.
When this is a backup of a system in use, that's typically doing what you'ld want
When you didn't expect this, you may want to inspect the results.

Format details


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Multiplatform (bsd, lin, win, osx), though not a standard-installed utility.

Designed as an improvement over tar, dar is generally more like zip, but simpler in that it doesn't have thirty years of details to it:

  • per-file compression (gzip, bzip2, lzo, xz, lzma)
  • stores a catalohue, so can list quickly, extract selectively
  • differential backup (changes since last full backup)
  • incremental backup (changes made since the last incremental backup)
  • handles nonregular/dir inodes
  • handles sparse files
  • handles hardlinks
  • handles ACLs, xattr, SELinux context
  • split archives,
  • encryption

See also:

command line 7zip

7zip has a command line utilities 7z and 7za (and more)


  • 7za is statically compiled, with just the core formats (which? apparently 7z, zip, gzip, bzip2?(verify))
is standalone
  • 7z is pluggable variant, so supports more but for some formats calls out to other programs.
more flexible
you may never need the extra formats (seems things like RAR, CAB, ARJ)

So typically 7z is flexible and what you want, while 7za is faster in some cases.

See also:

zip format notes


Appending to archive / moving files to archive


See Hashing notes#Checksum files

format notes

Backup strategies

File backup notes


Because laziness prevents me from consistently backup up stuff, it's useful to have it automated. These are almost necessarily quite specific, but are probably still useful starting points.

When in scripts, you can easily use cron too.

rsync can be a useful tool, and for text files you can abuse source control software like subversion.

Splitting to control file size

You can combine tar and split to generate a set of files not crossing some size limit - and get a constant-sized output regardless of input files.

The following example assumes you also want compression (and during rather than after)

tar cz /etc | split --bytes=1m --numeric-suffixes - data_backup_1gchunks.tgz_

In my case fits into data_backup_1gchunks.tgz_00 and data_backup_1gchunks.tgz_01

Since tar doesn't know about this, reading this out would be a bit more manual too, something like:

cat data_backup_1gchunks.tgz_* | tar xz

...assuming bash, which sorts expanded filenames alphabetically (ls does this too, at least by default). Other shells may do the same(verify). I wouldn't assume this on windows.

Tree to backup file

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Once upon a time I had the following in a cronjob:

tar -cvjf /backup_directory/codeproject_`date +%F`.tar.bz2 /path/to/codeproject
# parallelized variant (here using gzip for speed)
 tar cvf - /path/to/dataset | pigz -3 > dataset_`date +%F`.tgz

A variation with some exclusions:

find /path/to/codeproject ! -type d ! -wholename '*/img/*' -print0 | \
    tar -T - --null -cvjf /backup_directory/codeproject_`date +%F`.tar.bz2


  • date +%F produces something like 2009-12-29, which is a quick way to do unique-per-day and nicely sortable filenames.
  • The
    ! -type d
    in the second example will tell find to not print directory names, which is necessary because tar recurses any directory you give it (which would nullify the effect of the
    ! -wholename
  • if you use find, it's handy to pipe the null-delimited filename list (for filename safety) directly to tar (using -T - on tar).

(See also the find tricks on find and xargs, and remember that find's syntax is a little unusual)

For code, the easier setup (both more functional and more easily networked) is some versioning system; see subversion, git, and such.


Since you can abuse ssh as a data pipe, you can push backups elsewhere with little work:

tar cjvf - /path/to/codeproject  |  ssh user@backuphost "cat > mybackup.tbz2"
# or, for speed, using parallel compression and a simpler cipher:
tar cvf - /path/to/dataset | pigz -3 |  ssh -c arcfour user@backuphost "cat > mybackup.tgz"

To help estimate whether pigz is worth the compression, try something like:

tar cvf - . 2>/dev/null | pv -c -N RAW | pigz -3 - | pv -c -N COMP > /dev/null

(...though if you want this in a cronjob you'll likely want a SSH keypair setup to avoid password prompts in the process (also consider effective-user details, su with a shell, and that sort of detail)


For MySQL, see MySQL_notes#Backup.2FReplication

For postgres, see Postgres#Replication.2C_Backup

See also

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Application backup notes

Profile backup notes

Disk imaging notes

These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

See also:

Clonezilla notes

You may want a recent development version, because what some download sources consider a 'stable' version can be so old that they don't like booting on modern hardware (I got a kernel oops).

While clonezilla is smart about copying only used space (and compressing it), restoring needs a disk/partition no smaller than the original you copied.

Note that restoring specific partition (rather than disks) is more useful for specific-disk backup than it is for free-form duplication, because partition restores expects the same partition layout, or at the very least the same partition references (names).

See also