Archive and backup notes

From Helpful
Jump to navigation Jump to search
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Stuff vaguely related to storing, hosting, and transferring files and media:

Notes on...

Archives here typically means 'a collection of files serialized to a single file'. The "the best way to ensure a copy survives" sense is a topic in itself.

tar

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Tar files (also known as tarballs) are store uncompressed collections of files, with basic metadata -- but we usually add compression


Some example commands

# create uncompressed archive for a directory (v = report what files it's adding)
tar cvf things.tar dir/

# create compressed archive for a directory (v = report what files it's adding, z=gzip)
tar czvf things.tar.gz dir/

# list filenames in archive
tar tf things.tar.gz

# decompress either archive, into to the current working directory 
#   (which can make a mess if there is no directory structure in the tar, so probably do a list first)
tar xf things.tar.gz


Where:

  • c means 'create archive'
  • x means 'extract archive'
  • v for verbose, "list files you're adding / extracting", useful e.g. to see tarfile creation is taking in the right files
  • f means 'archive filename follows as next argument'


  • z for gzip (for a .tar.gz / .tgz)
  • j for bzip2 (for a .tar.bz2 / .tbz2) - not available in older versions

clumping those letters together like czvf is allowed for historical reasons. A slightly more modern style equivalent would be -c -z -v -f



History / moving goals

tar was classically used to store file+directory onto block storage, such as tape (tar = tape archive), and some of its features come fairly directly from wishes at the time, like quick seeking through those blocks.

Its basic goals are roughly

  • storing file metadata (name and more)
  • storing file data
  • use block size that makes seeking around easier, useful for linear media
  • allow us to append to existing storage without a bunch of rewriting


These days our wishes are simpler, mostly 'take directory tree and pack it into a single archive file'.

The purpose is like zip, but the format is older and simpler, two reasons it is basically just supported everywhere.



compressed tar

Yes, you can do create a tar file and then feed the resulting file into a compressor. That's all a .tar.gz or .tar.bz2 or similar is, in fact.

You can save a disk IO writing an intermediate file, by feeding the tar stream directly into a compressor. (The more compatible way is to do that via a pipe. Also, various tar implementations have, over time, added parameters for compressing themselves, more commonly for gzip, regularly for bzip2, sometimes for xz, lzip)

So the following three commands should all produce identical files:

tar cvf test.tar /test/ && bzip2 test.tar
tar cvf - /test/ | bzip2 > test.tar.bz2
tar cvjf test.tar.bz2 /test/

Newer versions of tar detect when files they read are of a supported compresse format, meaning you can omit z (or j) when reading archives (e.g. t, x)


compressed tars easy for distribution, but clunky for alteration or even listing

Classically no one wanted to compress the tar data, because tar was meant to create the bytestream to be written directly to tape, and that tar structure could be used to seek through tape. (If you wanted compression, you might compress individual files).


These days, the typical use of tar is to create a single file that unpacks to a directory, so you can save some transfer if you compress tarball.


But tar itself does not know about compression, and the compression used does not know about tar. A compressed tarball is just the result of feeding that finished tar file into into a compressor.

This is almost trivial to implement, and when extracting you know you want to read everything, from start to finish, and process the stream as you go, which is efficient.


...but it has drawbacks for most any other operation.

In particular, when you do not have quick read or quick write access to the underlying tar's stream, then you cannot quickly do tar operations like appending or removing data, or even listing.

Listing needs to go through the file, so implies uncompressing everything.
Appending or removing basically comes down to rewriting the entire tar stream

...this is

unlike uncompressed tars, in which you could seek through directly (only somewhat faster on tape which is linear-access, but pretty quick on our fancy random-access hard drives)
unlike most compression formats - they they tend to have one or a few file indices they can read instead of the entire file, and think in adding compressed chunks.

So when most files you add aren't compressible, you may as well skip compressing the tarball.



On UID/GID/permissions
Filtering what to add

e.g. in "Make a backup of my website code, but exclude that huge image directory:"

tar czvf ~/backup/code-$(date +%F).tgz . --exclude 'img/*'

Notes:

  • globs are used, so you probably want to have tar interpret them, not the shell, hence the singlequotes.
  • here meant for a daily file in a cronjob, as date +%F outputs something like 2017-04-22


For fancier filtering, you probably want find (or some shell-fu) to list things, as a "include only files listed in this file" (-T below, here piped in on stdin to avoid a temporary file) (or sometimes "exclude files listed in this file").

A practical example:

 find ~/public_html/  ! -type d  ! -wholename '*/img/*'  -print0 | \
     tar -T- --null -cvjf ~/backup/code-`date +%F`.tar.bz2

...which you can put in your crontab to have a simple automated backup.


Notes:

  • Read ! -wholename '*/img/*' as 'absolute path does not contain /img/'
  • ! -type d is there to prevent tar from its behaviour of recursing into directories you hand it (includes ".", which is easy to find using find)
Yes, --no-recursion on tar is also possible, and sometimes much simpler
  • For robustness against filenames with special characters, find outputs (-print0) and tar expects (--null) as delimiters between filenames.
You could also use grep/egrep in this filename pipe, as long as you use -azZ to make it expect and produce nulls.




Tar warnings and errors

tar: /path/to/something: file changed as we read it

And at the end a error exit delayed from previous messages)

tar does a stat() before and after it handles a file or directory,

and if mtime or size differ, it will say this, which is good behaviour since...
  • When reported on files, this can easily mean we tarred a half-changed file.
In theory this can also happen when the tar being created is within the paths included (though it should notice this and say file is the archive; not dumped)
  • When reported on directories it's less serious, often meaning a file was added or removed since tar started.
When this is a backup of a system in use, that's typically doing what you'ld want
When you didn't expect this, you may want to inspect the results.


tar: Removing leading `/' from member names

Consider that if a random archive from somewhere happened to include paths starting with /bin/ it would be a little too easy to overwrite, you know, many system files.

This is not something you want (Good luck unborking that, or removing the malware you just installed as root) unless you are very sure you want it, which is why it takes a little more learning and a few more explicit keystrokes to actually do this.



tar: ./curdir.tar: file is the archive; not dumped

Informational.

If you do something like

tar cf curdir.tar .

then tar will find that curdir.tar file while it's going through files to add.

It knows to skip it, instead of e.g. trying to add the archive to itself.

Format details

dar

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Multiplatform (bsd, lin, win, osx), though not a standard-installed utility.


Designed as an improvement over tar, dar is generally more like zip, but simpler in that it doesn't have thirty years of details to it:

  • per-file compression (gzip, bzip2, lzo, xz, lzma)
  • stores a catalohue, so can list quickly, extract selectively
  • differential backup (changes since last full backup)
  • incremental backup (changes made since the last incremental backup)
  • handles nonregular/dir inodes
  • handles sparse files
  • handles hardlinks
  • handles ACLs, xattr, SELinux context
  • split archives,
  • encryption


See also:

command line 7zip

7zip has a command line utilities 7z and 7za (and more)

Difference:

  • 7za is statically compiled, with just the core formats (which? apparently 7z, zip, gzip, bzip2?(verify))
is standalone
  • 7z is pluggable variant, so supports more but for some formats calls out to other programs.
more flexible
you may never need the extra formats (seems things like RAR, CAB, ARJ)

So typically 7z is flexible and what you want, while 7za is faster in some cases.



See also:


zip format notes

wishes

Appending to archive / moving files to archive

Checksums

See Checksum files

format notes

What are you doing it for?

Backup strategies

File backup notes

One-liners

Because laziness prevents me from consistently backup up stuff, it's useful to have it automated. These are almost necessarily quite specific, but are probably still useful starting points.

When in scripts, you can easily use cron too.


rsync can be a useful tool, and for text files you can abuse source control software like subversion.


Splitting to control file size

You can combine tar and split to generate a set of files not crossing some size limit - and get a constant-sized output regardless of input files.

The following example assumes you also want compression (and during rather than after)


tar cz /etc | split --bytes=1m --numeric-suffixes - data_backup_1gchunks.tgz_

In my case fits into data_backup_1gchunks.tgz_00 and data_backup_1gchunks.tgz_01

Since tar doesn't know about this, reading this out would be a bit more manual too, something like:

cat data_backup_1gchunks.tgz_* | tar xz

...assuming bash, which sorts expanded filenames alphabetically (ls does this too, at least by default). Other shells may do the same(verify). I wouldn't assume this on windows.



Tree to backup file

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Local

Once upon a time I had the following in a cronjob:

tar -cvjf /backup_directory/codeproject_`date +%F`.tar.bz2 /path/to/codeproject

# parallelized variant (here using gzip for speed)
 tar cvf - /path/to/dataset | pigz -3 > dataset_`date +%F`.tgz

A variation with some exclusions:

find /path/to/codeproject ! -type d ! -wholename '*/img/*' -print0 | \
    tar -T - --null -cvjf /backup_directory/codeproject_`date +%F`.tar.bz2

Notes:

  • date +%F produces something like 2009-12-29, which is a quick way to do unique-per-day and nicely sortable filenames.
  • The ! -type d in the second example will tell find to not print directory names, which is necessary because tar recurses any directory you give it (which would nullify the effect of the ! -wholename filter)
  • if you use find, it's handy to pipe the null-delimited filename list (for filename safety) directly to tar (using -T - on tar).

(See also the find tricks on find and xargs, and remember that find's syntax is a little unusual)


For code, the easier setup (both more functional and more easily networked) is some versioning system; see subversion, git, and such.

Remote

Since you can abuse ssh as a data pipe, you can push backups elsewhere with little work:

tar cjvf - /path/to/codeproject  |  ssh user@backuphost "cat > mybackup.tbz2"

# or, for speed, using parallel compression and a simpler cipher:
tar cvf - /path/to/dataset | pigz -3 |  ssh -c arcfour user@backuphost "cat > mybackup.tgz"

(...though if you want this in a cronjob you'll likely want a SSH keypair setup to avoid password prompts in the process (also consider effective-user details, su with a shell, and that sort of detail)


To help estimate whether pigz is worth the compression, try something like:

tar cvf - . 2>/dev/null | pv -c -N RAW | pigz -3 - | pv -c -N COMP > /dev/null



Database

For MySQL, see MySQL_notes#Backup.2FReplication

For postgres, see Postgres#Replication.2C_Backup


Application backup notes

Profile backup notes

Disk imaging notes

These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.


See also:

Clonezilla notes

You may want a recent development version, because what some download sources consider a 'stable' version can be so old that they don't like booting on modern hardware (I got a kernel oops).



While clonezilla is smart about copying only used space (and compressing it), restoring needs a disk/partition no smaller than the original you copied.

Note that restoring specific partition (rather than disks) is more useful for specific-disk backup than it is for free-form duplication, because partition restores expects the same partition layout, or at the very least the same partition references (names).

See also

Unsorted