Compression notes

From Helpful
Jump to: navigation, search
These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

Stuff vaguely related to storing, hosting, and transferring files and media:


Utilities and file formats (rather than methods and algorithms)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

General purpose


Commonly used, mostly because it is so well known.

There have been a bunch of revisions/additions (see version history), with increasing amounts of underlying compression formats.

Of the compression formats, some are outdated, and many are specialized, and there is optional encryption.

This also means not everything can open every zip file.

To avoid compatibility problems, ZIP files created for compatibility are based on DEFLATE which has barely changed in twenty years.

Related formats:

  • Java:
    • JAR (Java Archive) is actually ZIP, with the addition of a few (optional) files, e.g. a manifest. [1]
    • WAR (Web Application aRchive) usually refers to the Sun format that is a JAR used to distribute Java webapps. WAR files have a number of basic files and directories that are expected to be present. [2]
    • EAR (Enterprise ARchive) also serves specific purposes. [3]
  • XPI: Firefox' plugins are a standardized set of files within a ZIP file [4]

compression methods

General purpose, little point because basic deflate is usually better:

  • 0: Store (no compression)
  • 1: Shrink
  • 2: Reduce (compression factor 1)
  • 3: Reduce (compression factor 2)
  • 4: Reduce (compression factor 3)
  • 5: Reduce (compression factor 4)
  • 6: Implode

General purpose, more common:

  • 8: Deflate (the one that everything supports)
  • 9: Deflate64 ('Enhanced deflate')
  • 14: LZMA
  • 98: PPMd (since WinZip version 11(verify))
  • 12: bzip2 (since WinZip version 11(verify))

Media-specific compression methods:

  • 96: Compressed lossless JPEG
  • 97: WavPack


  • 10: Old IBM TERSE
  • 18: New IBM TERSE
  • 19: IBM LZ77 z
  • 7: "Reserved for Tokenizing compression algorithm" (?)

multi-part ZIP

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Multipart zip files were once used to get over the maximum size of a medium (e.g. floppy, CD) or maximum supported/allowed file size (e.g. email attachments, FAT32).

(Note that multi-part ZIPs are now often created as Zip64 for large-file reasons(verify))

split zip

They are typically named z01, z02, etc., and the .zip is the last part (not the first as some assume).

These files are mostly just the compressed bytes split at arbitrary points, except that the header offsets are relative to the start of each part.

As such you can't just concatenate them together (though concatenating them properly, or fixing the offsets afterwards, is relatively simple).

Ideally your uncompressor just knows about split multipart zip. For example, on linux it's easier to use 7z/7za than using cat and Info-ZIP's -F / -FF.

spanned zip

Spanned zip is the same as the above in terms of file content, different only in the file naming, specifically all parts have the same name, but reside on different media.

For example, a set of floppies all have on them, which happen to be sequential parts known only via labeling on the floppies.

Relatively rare in that it's impractical to keep all these files on one medium (you'ld keep them in separate directories, or more probably, rename them to be able to handle them).

See also [6]


7zip (when handing non-7z zips)

doesn't create multi-part zip, it simply splits the whole stream at arbitrary points (doesn't alter the offsets),
names them,, etc.
7zip obviously understands what it made itself
For other tools you would directly concatenate these files, which creates a correct single zip file.


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

From the WinZip team, based roughly on ZIP but different enough to be a separate format.

Not widely supported.


Compresses better than ZIP, also proprietary but comparably available, so was used instead of ZIP in certain areas.

Uses LZ and PPM(d)


The .7z format archive focuses on LZMA, which performs noticably better than ZIP's classical DEFLATE, and regularly a little better than other alternatives such as RAR, bz2, gzip, and others.

The 7zip interface uses plugins to support external formats. When you have multiple 7z commands, the difference is that:

  • 7z
    : the variant you can tell to use plugins
can read/write .7z, .zip, .gz, .bz2, .tar files, and reads from .rar, .cab, .iso, .arj, .lzh, .chm, .Z, .cpio, .rpm, .deb, and .nsis
  • 7za
    : a standalone variant, which handles only .7z, .zip, .gz, .bz2, .tar, .lzma, .cab, .Z
  • 7zr
    : a lightweight standalone version that handles only .7z
  • 7zg
    : GUI

In linux CLI, usually just use 7z (some package only 7za) and forget about the rest.


Based on DEFLATE, which is a combination of LZ77 and Huffman coding.

Things like WinRAR and 7zip can also open these.

Compression amount/speed tradeoff:

  • has -1 to -9 (and also --fast and --best, respectively 1 and 9)
  • particularly -8 and -9 seem to take a lot more time than is worth it compressionwise.
  • default is -6, which seems a practical choice


Based on BWT and Huffman. Generally gives better compression than deflate (gzip, ZIP).

Things like WinRAR and 7zip can also open these.

Compression amount/speed tradeoff:

  • -9 is the default (verify)
  • compression seems to level off to sub-percent size differences at and above -7
  • Time taken seems to be fairly linear from -1 to -9.
    • if you only care about having compression at all, you could just use -1 (it's perhaps twice the speed of -9)

This seems to mean there's not a sweet spot (as clear as e.g. with gzip) without specifically valuing time over drive space, or the other way around.





LZMA compressor.

In terms of compression per time spent

lzip -0 is comparable to gzip -6
lzip -3 is comparable to bzip2 -9 / xz -3
lzip -9 is comparable to xz -9(verify)

Apparently also seems a little better at error recovery(verify) [7]

lzip and zx are both LZMA family, xz is more recent though there are arguments why it's not the best design as an archiver, whereas lzip just does just the thing it does.

Parallelized variants, and speed versus compression


  • if you care more about speed: pigz -3
  • if you care more about space: lbzip2 -9
  • if you care even more about space, you're probably already an xz user :)


  • for bzip2 there's lbzip2 and pbzip2
  • for gzip there's pigz
  • for lzip there's plzip
  • for xz thre's pxz and pixz

Some details:

  • pigz speed: pigz -3 is ~50% faster than the default -6, at still decent compression
  • pigz -11 is zopfli, a few percent better but much slower
  • lbzip2:
speed barely changes with compression level, so you may as well use -9, also the default
(-1 seems slightly slower than the rest, even. Yeah, weird)
memory hungrier than others, also at (many-thread) decompression
  • pigz versus lbzip2
lbzip2 compresses better at all settings (one test file: lbzip2:24% pigz:35%)
pigz -3 is twice as fast as lbzip2 ((verify))
pigz -6 is ~20% faster than lbzip2 (verify)
pigz -7 (and higher) are slower than lbzip2
I've heard that lbzip2 scales better so in some cases is faster; TODO: check
  • pbzip2 is slower than lbzip2 at higher compression, similar speed at lower compression
  • pxz and pixz are slower than the above at higher compression (...what did you expect?:)
(more comparable at lower compression - have not checked to what degree)

Quick and dirty benchmarks:

(on tmpfs, so negligible IO) on a 218MB, fairly compressible floating point number data file

On an AMD FX6300:

  • pigz -1 compresses it to 80MB (36%) in 1.0sec
  • pigz -2 compresses it to 77MB in 1.1sec
  • pigz -3 compresses it to 76MB in 1.3sec
  • pigz -4 compresses it to 75MB in 1.4sec
  • pigz -5 compresses it to 75MB in 1.8sec
  • pigz -6 compresses it to 75MB in 2.2sec
  • pigz -7 compresses it to 75MB in 2.4sec
  • pigz -8 compresses it to 75MB in 4.2sec
  • pigz -9 compresses it to 75MB (34%) in 11.5sec
  • lbzip2 -1 compresses it to 55MB (25%) in 3.0 sec (yes, slower than -6!)
  • lbzip2 -2 compresses it to 51MB in 2.7 sec
  • lbzip2 -6 compresses it to 48MB in 2.7 sec
  • lbzip2 -9 compresses it to 46MB (21%) in 2.8sec
  • pbzip2 -6 compresses it to 48MB in 3.3sec

On a dual Xeon E5645

  • pigz -1 took 0.58sec
  • pigz -3 took 0.7sec
  • pigz -6 took 1.0sec
  • pigz -8 took 2.0sec
  • pigz -9 took 6.2sec
  • lbzip2 -1 took 1.35sec (seemed less pronounced than above)
  • lbzip2 -2 took 1.3sec
  • lbzip2 -6 took 1.4sec
  • lbzip2 -9 took 1.4sec
  • pbzip2 -6 took 1.7sec

So they all seem to scale roughly linearly (going by passmark score)

Older or less used



Seen in the late nineties, early noughties. Similar to RAR in that it outperforms ZIP, but RAR is more popular.


Seen in the nineties, rarely used now.

Had robust multi-file handling before ZIP, so saw use distributing software, e.g. in BBSes.

There was a successor named JAR, not to be confused with the Java -related JAR archives.


Used on the Amiga and for some software releases.

Now rarely used in the west, but still used in Japan, and there is an LZH compressed folder extension for WinXP (analogous to zipped folder support).

There was a successor named LHx, then LH.

Less practical nerdery

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

There is a whole set of compression types that are exercises in getting very high compression at the cost of a bunch of time.

They are usually barely command line utilities, let alone easy to use.

These include:

...and many others.


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Lossless methods

Entropy coding (minimum redundancy coding)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Mostly per-symbol coding - where a symbol are logical parts of the input, regularly fixed-size things, e.g. a character or byte in a file, a pixel of an image, etc.

Probably the most common symbol-style coding, contrasted with variable-run coding, which is often dictionary coding.

Modified code


See also:

Adaptive Huffman (a.k.a. Dynamic Huffman)

See also:


Similar to Huffman in concept and complexity.

A little easier to implement, so a nicer programming exercise, yet Huffman typically performs better.

See also:

Elias, a.k.a. Shannon-Fano-Elias

See also:

Golomb coding

Rice coding

universal codes

In itself mostly a propery

Fibonacci coding

Elias gamma coding

Exp-Golomb (Exponential-Golomb)

Range encoding

Arithmetic coding

Dictionary coding

Dictionary coders work by building up a dictionary of "this short code means this longer data".

Most of the work is in making a dictionary that is efficient for a given set of data.


Primarily combines LZ77 and Huffman coding.

Used in ZIP, zlib, gzip, PNG, and others.

See also RFC 1951

Modern compressors may still produce standard deflate streams for compatibility. For example, ZIP's deflate is still the most uniquitously supported coding in ZIP files, #zopfli can be used as a drop-in in gzip (and in theory zip).

(ZIP also uses Deflate64 (see its #compression_methods), which is a relatively minor variation)

LZ family and variants

Referring to an approach that Abraham Lempel, and Jacob Ziv used, and various others since.

LZ77 and LZ78 (a.k.a. LZ1 and LZ2)

Refer to the original algorithms described in publications by Lempel and Ziv in 1977 and 1978.


Short for Lempel, Ziv, Welch, refers to an extension of LZ78 described in 1984 by Terry Welch.

Used in

Limited adoption because implementations were due to patents from 1983 to 2003 (or 2004, depending on where you lived).

See also:


Lempel-Ziv-Storer-Szymanski (1984)

Used by PKZip, ARJ, ZOO, LHarc


Lempel-Ziv-Markov chain Algorithm (since 1998)

Mostly like LZ77/deflate, but with smarter dictionary building(verify).

Default method in 7Zip. An optional method in some new versions of the ZIP format.



LZW variant (letters stand for Miller, Wegman)


LZW variant (letters stand for all prefixes)


Syllable-based LZW variant



Built for very fast decompression (and can do it in-place).

Typically used at lower compression settings, to e.g. get ~half the compression ration of gzip at ~four times the speed.

Decompression is always quite fast.

This is sometimes much of the point: CPU-cheap way to move a factor more data through the same channels.

You can also make it work harder to get compression ratio ant time quite comparable to gzip -- while retaining the high-speed low-resource decompression.


Largely comparable to LZO, but goes for more constrained memory use (verify)


Basically a variant of LZO that further prefers compression speed over compression ratio (by not being as exhaustive as typical DEFLATE style compression).

You can think of it as a "what compression can I get for CPU-cheap?", because it can compress at (order of magnitude) hundreds of MByte/s per modern core.

It is notably used in ZFS, where it is also made to (quickly) detect when no compression should be used at all.


Variant of LZ4

  • same cheap decompression speed
  • typically 20% better compression, but at 5x to 10x slower compression speeds
  • no fast decision to not compress (verify)

...i.e. it's meant to save cost for archiving and backups.

...where explicit compression (e.g. throw it at bzip2 or xz) may be equally sensible. (There seems no hurry to get it into ZFS)


(1995) Used in Amiga archives, Microsoft Cabinets, Microsoft Compressed HTML (.chm), Microsoft Ebook format (.lit), and one of the options in the WIM disk image format.


Lempel-Ziv Ross Williams (1991), which itself has seven variants.


Letters stand for Jeff Bonwick. Derived from LZRW (specifically LZRW1)

Used in ZFS.


Reduced Offset Lempel-Ziv


High compression is always low speed

The highest compression settings tend to mean "search more exhaustively within the given data", and it tends to require signigicantly more time to do so.

In part because they tend to work on a shortish window by default, and can be told to look at more data for redundancy. This typically results in more compression, but it's a diminishing-returns deal. It also tends to mean explosively higher RAM requirements for compression and decompression.

So after a while, it's just not worth it for general use.

There is still an argument for cases where you will read/send/serve the result many times - it will eventually make up for the initial investment (or when storage is more expensive to your bottom line than CPU power, which should be basically never). This is why things like png crushers exist, the reason behind zopfli, and why you may wish to sometimes use zx or lzip for software distribution.

Generic tools may not even let you try this, e.g. bzip2 settings barely matter, and for good reason.

High speed, reasonable compression

High speed, reasonable compression

(see above)

These can make sense in systems

where CPU is generally not the bottleneck, so spending a little CPU on light data compression is almost free
and/or tend to write as much as they read (e.g. storage/database systems)

This e.g. makes sense

for database and storage systems, since these are often relatively dedicated, and tend to have a core or two to spare.
and/or when it lowers network transmission (bandwidth and interrupt load)

  • LZO (see above)
  • LZF (see above)
  • LZ4, e.g. used in ZFS's transparent data compression
(e.g. compresses similarly to e.g. FastLZ, QuickLZ in less time)
  • snappy (previously zippy) seems similar to LZ4
  • zstd and gipfeli seems to intentionally aim at more compression than LZ4/snappy at speeds still higher than typical DEFLATE'. (verify)

See also:


Floating point compression

Floating point data, particularly when higher-dimensional, has a tendency to be large.

And also to have smooth, easily modelled patterns.

Very clean floating point data may compress well using completely general-purpose compression anyway.

...yet there is more to be won for various everyday data.


zopfli, brotli

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

These are meant for the web.

zopfli produces a standard DEFLATE stream (often then wrapped in zlib or gzip) that tends to squeeze out a little more compression than zlib/gzip at their max settings, just by exhausting more alternatives - which also makes it much slower.

It is intended for things you compress once, then serve many times, e.g. on the web for woff, PNG and such. Certainly not for on-the-fly compression or one-time transfers.

Brotli is built for fast on-the-fly text compression, and tuned with dictionaries that do better at some of the more predictable text on the web. It tends to shave 20% off HTML off HTML, JS, and CSS.

Which is comparable to zlib/gzip/DEFLATE on its lighter settings, except brotli's is meant to decompress faster, and have its resource requirements be well bound (not unique features, but can be a useful combination for some purposes).(verify)

See also:


Burrows-Wheeler transform (BWT)

See Searching_algorithms#Burrows-Wheeler_Transform

Being reversible and putting similar patterns near each other, using the BWT as a step in compression algorithms makes sense, and is indeed used: bzip2 is built around it.

Prediction by Partial Matching (PPM)

PPM is the concept, PPM(n) is PPMd

Dynamic Markov Compression (DMC)

Context tree weighting (CTW)

Context Mixing (CM)

Lossy methods

Note that lossy (sometimes 'lossful') methods often employ lossless coding of some perceptual approximation.

Transparency is the quality of compressed data being functionally indistinguishable from the original - a lack of (noticeable) artifacts. Since this is a perceptual quality, this isn't always easy to quantify.

Ideally, lossy compression balances transparency and low size/bitrate, and some methods will search for the the compression level at which a good balance happens.






Pack200 is a compression scheme for Java (specified in JSR 200), which seems to specialize in informed/logical compression of bytecode, (and bytecode within JARs)

Pack200 plus gzip compress these things better than gzip alone.

It's used with JWS, among other things

See also:

See also