Archive and backup notes
Stuff vaguely related to storing, hosting, and transferring files and media: |
Practical notes
Archiving versus backup
What are you doing it for?
RAID is not backup
Technical notes
Archives here typically means 'a collection of files serialized to a single file'. The "the best way to ensure a copy survives" sense is a topic in itself.
tar
Tar files (also known as tarballs) are store uncompressed collections of files, with basic metadata -- but we usually add compression
Some example commands
# create uncompressed archive for a directory (v = report what files it's adding)
tar cvf things.tar dir/
# create compressed archive for a directory (v = report what files it's adding, z=gzip)
tar czvf things.tar.gz dir/
# list filenames in archive
tar tf things.tar.gz
# decompress either archive, into to the current working directory
# (which can make a mess if there is no directory structure in the tar, so probably do a list first)
tar xf things.tar.gz
Where:
- c means 'create archive'
- x means 'extract archive'
- v for verbose, "list files you're adding / extracting", useful e.g. to see tarfile creation is taking in the right files
- f means 'archive filename follows as next argument'
- z for gzip (for a .tar.gz / .tgz)
- j for bzip2 (for a .tar.bz2 / .tbz2) - not available in older versions
clumping those letters together like czvf is allowed for historical reasons. A slightly more modern style equivalent would be -c -z -v -f
History / moving goals
tar was classically used to store file+directory onto block storage, such as tape (tar = tape archive), and some of its features come fairly directly from wishes at the time, like quick seeking through those blocks.
Its basic goals are roughly
- storing file metadata (name and more)
- storing file data
- use block size that makes seeking around easier, useful for linear media
- allow us to append to existing storage without a bunch of rewriting
Tape has a lot of seek time (the winding), and without any structure, you would need to read and interpret all data - if you wanted to ever know where you are, anyway. Which you do. A tar archive is seekable because the header of each file includes its size in the archive. On an actual tape, which had a lthis made it a lot easier to skip ahead (roughly 'find next block, see what this is').
The seeking was no longer much of a selling point, and the appending doesn't matter a lot either.
These days our wishes are simpler, mostly 'take directory tree and pack it into a single archive file' - just the file-and-metadata part.
The purpose is like zip, but the format is older and simpler, two reasons it is basically just supported everywhere.
But it doesn't even compress
compressed tar
On tape, if you wanted compression, you would either not compress at all (e.g. for speed), or compress each file before storage (not overly common). Either way, tar wouldn't need to concern itself about it.
Tarballs stored on platter got a different use - packaging a bunch of files to send elsewhere.
compression suddenly was interesting.
You can do create a tar file, and then feed the resulting file into a compressor.
That's all a .tar.gz or .tar.bz2 (or similar) is, in fact - two separate steps. And you can do that more efficiently by doing both at the same time, saving a write and a read to disk.
So, for example, the following three commands should all produce identical files:
tar cvf test.tar /test/ && bzip2 test.tar
tar cvf - /test/ | bzip2 > test.tar.bz2
tar cvjf test.tar.bz2 /test/
Newer versions of tar will also detect when files they read a (supported) compressed format, meaning that even without z or j parameters, reading archives (e.g. t, x) will work fine.
compressed tars easy for distribution, but clunky for alteration or even listing
The downside to compressing the entire thing means that the "seek quickly through the archive" is now defeated by the fact you can't seek through the compression around it (without decompressing the entire thing).
There is one thing you can make fast: decompress, hand that to tar as you decompress, and list or extract the files that tar passes over.
Any other operation (remove, update) comes down to redoing the entire thing.
Even listing needs to decompress the entire thing - most other compression formats avoid that.
This is roughly why compressed tarballs are not common, except for things like software distribution.
On UID/GID/permissions
Filtering what to add
e.g. in "Make a backup of my website code, but exclude that huge image directory:"
tar czvf ~/backup/code-$(date +%F).tgz . --exclude 'img/*'
Notes:
- globs are used, so you probably want to have tar interpret them, not the shell, hence the singlequotes.
- here meant for a daily file in a cronjob, as date +%F outputs something like 2017-04-22
For fancier filtering, you probably want find (or some shell-fu) to list things, as a "include only files listed in this file" (-T below, here piped in on stdin to avoid a temporary file) (or sometimes "exclude files listed in this file").
A practical example:
find ~/public_html/ ! -type d ! -wholename '*/img/*' -print0 | \
tar -T- --null -cvjf ~/backup/code-`date +%F`.tar.bz2
...which you can put in your crontab to have a simple automated backup.
Notes:
- Read ! -wholename '*/img/*' as 'absolute path does not contain /img/'
- ! -type d is there to prevent tar from its behaviour of recursing into directories you hand it (includes ".", which is easy to find using find)
- Yes, --no-recursion on tar is also possible, and sometimes much simpler
- For robustness against filenames with special characters, find outputs (-print0) and tar expects (--null) as delimiters between filenames.
- You could also use grep/egrep in this filename pipe, as long as you use -azZ to make it expect and produce nulls.
Tar warnings and errors
tar: /path/to/something: file changed as we read it
And at the end a error exit delayed from previous messages)
tar does a stat() before and after it handles a file or directory,
- and if mtime or size differ, it will say this, which is good behaviour since...
- When reported on files, this can easily mean we tarred a half-changed file.
- In theory this can also happen when the tar being created is within the paths included (though it should notice this and say file is the archive; not dumped)
- When reported on directories it's less serious, often meaning a file was added or removed since tar started.
- When this is a backup of a system in use, that's typically doing what you'ld want
- When you didn't expect this, you may want to inspect the results.
tar: Removing leading `/' from member names
Consider that if a random archive from somewhere happened to include paths starting with /bin/ it would be a little too easy to overwrite, you know, many system files.
This is not something you want (Good luck unborking that, or removing the malware you just installed as root) unless you are very sure you want it, which is why it takes a little more learning and a few more explicit keystrokes to actually do this.
tar: ./curdir.tar: file is the archive; not dumped
Informational.
If you do something like
tar cf curdir.tar .
then tar will find that curdir.tar file while it's going through files to add.
It knows to skip it, instead of e.g. trying to add the archive to itself.
Format details
dar
Multiplatform (bsd, lin, win, osx), though not a standard-installed utility.
Designed as an improvement over tar, dar is generally more like zip, but simpler in that it doesn't have thirty years of details to it:
- per-file compression (gzip, bzip2, lzo, xz, lzma)
- stores a catalohue, so can list quickly, extract selectively
- differential backup (changes since last full backup)
- incremental backup (changes made since the last incremental backup)
- handles nonregular/dir inodes
- handles sparse files
- handles hardlinks
- handles ACLs, xattr, SELinux context
- split archives,
- encryption
See also:
command line zip
(the Info-ZIP implementation
The simplest thing you might care about:
zip -r things.zip things/
command line 7zip
7zip has a command line utilities 7z and 7za (and more)
Difference:
- 7za is statically compiled, with just the core formats (which? apparently 7z, zip, gzip, bzip2?(verify))
- is standalone
- 7z is pluggable variant, so supports more but for some formats calls out to other programs.
- more flexible
- you may never need the extra formats (seems things like RAR, CAB, ARJ)
So typically
- 7z is flexible and what you want if you use it as a generic tool
- 7za is slightly faster in some cases.
See also:
- https://sevenzip.osdn.jp/chm/cmdline/syntax.htm
- https://sevenzip.osdn.jp/chm/cmdline/commands/index.htm
zip format notes
wishes
Appending to archive / moving files to archive
Checksums
See Checksum files
format notes
Backup strategies
File backup notes
One-liners
Because laziness prevents me from consistently back up stuff, it's useful to have it automated.
Such automation is often quite specific, but are probably still useful starting points.
When in scripts, you can easily use cron too.
rsync can be a useful tool to throw at a lot of files (Time Machine is conceptually based on something like rsync)
Splitting to control file size
If you want to avoid single huge files, you could feed the output of e.g. tar into split.
The following example assumes you also want compression (and during, rather than after).
tar cz /etc | split --bytes=1m --numeric-suffixes - data_backup_1gchunks.tgz_
In my case fits into data_backup_1gchunks.tgz_00 and data_backup_1gchunks.tgz_01
Since tar doesn't know about this splitting at all, reading this out would be a bit more manual too, something like:
cat data_backup_1gchunks.tgz_* | tar xz
(here counting on bash's globbing expansion in lexical order)
Tree to backup file
Local
Once upon a time I had the following in a cronjob:
tar -cvjf /backup_directory/codeproject_`date +%F`.tar.bz2 /path/to/codeproject
# parallelized variant (here using gzip for speed)
tar cvf - /path/to/dataset | pigz -3 > dataset_`date +%F`.tgz
A variation with some exclusions:
find /path/to/codeproject ! -type d ! -wholename '*/img/*' -print0 | \
tar -T - --null -cvjf /backup_directory/codeproject_`date +%F`.tar.bz2
Notes:
- date +%F produces something like 2009-12-29, which is a quick way to do unique-per-day and nicely sortable filenames.
- The ! -type d in the second example will tell find to not print directory names, which is necessary because tar recurses any directory you give it (which would nullify the effect of the ! -wholename filter)
- if you use find, it's handy to pipe the null-delimited filename list (for filename safety) directly to tar (using -T - on tar).
(See also the find tricks on find and xargs, and remember that find's syntax is a little unusual)
For code, the easier setup (both more functional and more easily networked) is some versioning system; see subversion, git, and such.
Remote
Since you can abuse ssh as a data pipe, you can push backups elsewhere with little work:
tar cjvf - /path/to/codeproject | ssh user@backuphost "cat > mybackup.tbz2"
# or, for speed, using parallel compression and a simpler cipher:
tar cvf - /path/to/dataset | pigz -3 | ssh -c arcfour user@backuphost "cat > mybackup.tgz"
(...though if you want this in a cronjob you'll likely want a SSH keypair setup to avoid password prompts in the process (also consider effective-user details, su with a shell, and that sort of detail)
To help estimate whether pigz is worth the compression, try something like:
tar cvf - . 2>/dev/null | pv -c -N RAW | pigz -3 - | pv -c -N COMP > /dev/null
Database
For MySQL, see MySQL notes#Backup.2FReplication
For postgres, see Postgresql notes#Replication.2C_Backup
Application backup notes
Profile backup notes
Disk imaging notes
📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense. |
See also:
Clonezilla notes
You may want a recent development version, because what some download sources consider a 'stable' version can be so old that they don't like booting on modern hardware (I got a kernel oops).
While clonezilla is smart about copying only used space (and compressing it),
restoring needs a disk/partition no smaller than the original you copied.
Note that restoring specific partition (rather than disks) is more useful for specific-disk backup than it is for free-form duplication, because partition restores expects the same partition layout, or at the very least the same partition references (names).
See also
Unsorted