Compression notes
📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense. |
Stuff vaguely related to storing, hosting, and transferring files and media: |
Utilities and file formats (rather than methods and algorithms)
General purpose
ZIP
Commonly used, mostly because it caught on so well in the mid-nineties it sort of became a generic brand name for compressed files.
Created by PKWare, as PKZIP,
though many would use the WinZip product not from PKWare.
There have been a bunch of revisions/additions (see version history), and the amount of underlying compression formats each item/file can use has increased over time.
Of the compression formats,
some are outdated,
many are specialized.
To avoid compatibility problems, ZIP files created for compatibility are based on DEFLATE. Those particular parts have basically not changed in twenty years.
There is optional encryption (with an compatibility kerfuffle in the early years). (verify)
compression methods
General purpose, little point because basic deflate is usually better:
- 0: Store (no compression)
- 1: Shrink
- 2: Reduce (compression factor 1)
- 3: Reduce (compression factor 2)
- 4: Reduce (compression factor 3)
- 5: Reduce (compression factor 4)
- 6: Implode
General purpose, more common:
- 8: Deflate (the one that everything supports)
- 9: Deflate64 ('Enhanced deflate')
General purpose, fancier:
- 14: LZMA
- 98: PPMd (since WinZip version 11(verify))
- 12: bzip2 (since WinZip version 11(verify))
- 95: XZ
Media-specific:
- 94: MP3
- 96: Compressed lossless JPEG
- 97: WavPack
Other:
- 10: Old IBM TERSE
- 18: New IBM TERSE
- 19: IBM LZ77 z
- 93 (previously 20): zstd
- 7: "Reserved for Tokenizing compression algorithm" (?)
- 99: (AES?) encryption hides the compression method
multi-part ZIP
Multipart zip files were once used to get over the
maximum size of a medium, in particular floppies but also CDs,
maximum supported/allowed file size (e.g. email attachments, FAT32).
(Note that multi-part ZIPs are now often created as Zip64 for large-file reasons(verify))
split zip
These are typically named z01, z02, etc., and the .zip is the last part (not the first, as would also be quite sensible to assume).
These files are mostly just the compressed bytes split at arbitrary points, except that the offsets in each part's headers are relative to the start of that part(verify).
Which means that just concatenating them together does not make a correct zip file. Code that fixes the offsets while concatenating (or even fixes them after naive concatentation) would be relatively simple, and there are tools to do both.
Ideally your uncompressor just knows about split multipart zip, though, because it's fewer steps for everyone involved.
spanned zip
Spanned zip is the same as the above in terms of file contents, different only in the file naming, specifically all parts have the same name, but reside on different media.
For example, a set of floppies all have game.zip on them, which happen to be sequential parts known only via physical labeling on the floppies.
Relatively rare, in that it's impractical to keep all these files on one medium as it implies some extra actions during decompressing.
See also [1]
Other
7zip (when handing non-7z zips)
- doesn't create multi-part zip, it simply splits the whole stream at arbitrary points (doesn't alter the offsets),
- names them name.zip.001, name.zip.002, etc.
- extraction:
- 7zip obviously understands what it made itself
- For other tools you would directly concatenate these files, which creates a correct single zip file.
- https://superuser.com/questions/602735/how-could-i-portably-split-large-backup-files-over-multiple-discs/602736#602736
ZIP file magic
ZIP-based packages or documents
More format notes
Appending stuff
File formats that are ZIP based
ZIPX
From the WinZip team, seems to just be ZIP but, given the choice to more aggressively use newly introduced compression formats (BZip, LZMA, PPMd, Jpeg and Wavpack) over DEFLATE, they consider it a separate format(verify).
Not widely supported by other ZIP tools.
RAR
Compresses better than classic ZIP, also proprietary but comparably available, so for a decade or two was frequently seen used instead of ZIP in certain areas.
Uses LZ and PPM(d).
7z
The .7z format archive focuses on LZMA, which performs noticably better than ZIP's classical DEFLATE, and regularly a little better than other alternatives such as RAR, bz2, gzip, and others.
The 7zip program is useful as an interface that handles all those formats, but the 7zip format itself is not supported by many others.
The 7zip interface uses plugins to support external formats.
When you use it from the CLI and have multiple 7z commands, the difference is that:
- 7z: the variant you can tell to use plugins
- can read/write .7z, .zip, .gz, .bz2, .tar files, and reads from .rar, .cab, .iso, .arj, .lzh, .chm, .Z, .cpio, .rpm, .deb, and .nsis
- 7za: a standalone variant, which handles only .7z, .zip, .gz, .bz2, .tar, .lzma, .cab, .Z
- 7zr: a lightweight standalone version that handles only .7z
- 7zg: GUI
In general (e.g. linux CLI) you can use 7z (some package only 7za) and forget about the rest.
gzip
Based on DEFLATE, which is a combination of LZ77 and Huffman coding.
Fairly widely supported (things like WinRAR and 7zip can also open these).
Compression amount/speed tradeoff:
- has -1 to -9 (and also --fast and --best, respectively 1 and 9)
- particularly -8 and -9 seem to take a lot more time than is worth it compression-wise.
- default is -6, which seems a practical choice
bzip2
Based on BWT and Huffman.
Generally gives better compression than DEFLATE (e.g. gzip, classic/compatible ZIP), and a little more resource-intensive.
Like gzip, this is generic enough that various other tools (e.g. 7zip) can also open these.
Compression amount/speed tradeoff:
- -9 is the default (verify)
- compression seems to level off to sub-percent size differences at and above -7
- Time taken seems to be fairly linear from -1 to -9.
- ...so if you only care about speed more than space could you use -1 (it's perhaps twice the speed of -9)
This seems to mean there's not a sweet spot (as clear as e.g. with gzip) without specifically valuing time over drive space, or the other way around.
pax
https://en.wikipedia.org/wiki/Pax_(Unix)
xar
Used around OSX,
apparently replaced gzipped pax
https://en.wikipedia.org/wiki/Xar_(archiver)
XIP
.xar which can be digitally signed.
Used around OSX
https://en.wikipedia.org/wiki/.XIP
ARC
ARJ
Seen in the nineties, rarely used now.
Had useful multi-file handling before ZIP really did, so saw use distributing software, e.g. in BBSes.
There was a successor named JAR, not to be confused with the Java -related JAR archives.
ACE
Seen in the late nineties, early noughties. Similar to RAR in that it outperforms ZIP, but RAR is more popular.
LHA/LZH (LHarc)
Used on the Amiga and for some software releases.
Now rarely used in the west, but still used in Japan, and there was is an LZH compressed folder extension for WinXP, analogous to zipped folder support in windows in the west.
There was a successor named LHx, then LH.
xz
LZMA algorithm, often with the file extension ".xz"
lzip
LZMA compressor.
In terms of compression per time spent
- lzip -0 is comparable to gzip -6
- lzip -3 is comparable to bzip2 -9 / xz -3
- lzip -9 is comparable to xz -9(verify)
Apparently also seems a little better at error recovery(verify) [2]
lzip and zx are both LZMA family,
xz is more recent though there are arguments why it's not the best design as an archiver,
whereas lzip just does just the thing it does.
http://www.nongnu.org/lzip/lzip.html
Speed versus compression, including parallelized variants
tl;dr:
- Parallelized variants include:
- for bzip2 there's lbzip2 and pbzip2
- for gzip there's pigz
- for lzip there's plzip
- for xz there's xz itself, and also pxz and pixz
- if you care more about speed: pigz -3
- if you care more about space: lbzip2 -9 (-9 is the default, so is optional to specify)
- if you care even more about space, xz without parallelization seems best, but is rather slower
Some notes:
- pigz speed: pigz -3 is ~50% faster than the default -6, at still decent compression
- pigz -11 is zopfli, a few percent better but much slower
- lbzip2:
- speed barely changes with compression level, so you may as well use -9, also the default (-1 seems slightly slower than the rest, even. Yeah, weird)
- memory hungrier than others, also at (many-thread) decompression
- pigz versus lbzip2 - draw your own conclusions from below, and do tests on your own typical data(!), but it seems:
- you can expect bzip2 to compress better than gzip at all settings, parallel or not (one test file: lbzip2:24% pigz:35%)
- pigz -3 is noticably faster than lbzip2((verify))
- pigz -7 (and higher) are slower than lbzip2 (so consider whether it's worth it at all)
- I've heard that lbzip2 scales better with cores so on beefy dedicated hardware things may be a little different
- pbzip2 is slower than lbzip2 at higher compression, similar speed at lower compression
- pxz and pixz are slower than the above at higher compression (...what did you expect?)
- (more comparable at lower compression - have not checked to what degree)
- it seems that parallel xz doesn't compress as well as non-paralellized xz
Quick and dirty benchmarks:
(on tmpfs, so negligible IO) on a 218MB, fairly compressible floating point number data file
On an AMD FX6300:
- pigz -1 compresses it to 80MB (36%) in 1.0sec
- pigz -2 compresses it to 77MB in 1.1sec
- pigz -3 compresses it to 76MB in 1.3sec
- pigz -4 compresses it to 75MB in 1.4sec
- pigz -5 compresses it to 75MB in 1.8sec
- pigz -6 compresses it to 75MB in 2.2sec
- pigz -7 compresses it to 75MB in 2.4sec
- pigz -8 compresses it to 75MB in 4.2sec
- pigz -9 compresses it to 75MB (34%) in 11.5sec
- lbzip2 -1 compresses it to 55MB (25%) in 3.0 sec (yes, slower than -6!)
- lbzip2 -2 compresses it to 51MB in 2.7 sec
- lbzip2 -6 compresses it to 48MB in 2.7 sec
- lbzip2 -9 compresses it to 46MB (21%) in 2.8sec
- pbzip2 -6 compresses it to 48MB in 3.3sec
Similar tests on a dual Xeon E5645 were approximately twice as fast,
so it seems to scale roughly linearly with cores (going by passmark score).
---
comparing xz with others in terms of compression and time
To give a case where xz is a reasonable, the following was tested on 200MByte of quite-compressible XML.
- xz -0: 15MByte in 3.3sec
- xz -1: 13MByte in 3.5sec
- xz -2: 8.7MByte in 4.3sec
- xz -3: 8.2MByte in 6.8sec
- xz -4: 9.4MByte in 14sec
- xz -5: 2.6MByte in 20sec
- xz -6: 2.4MByte in 30sec (default)
- xz -7: 2.4MByte in 30sec
- xz -8: 2.2MByte in 30sec
- xz -9: 2.2MByte in 32sec
For reference on the same file:
- gzip -6: 18MByte in 3sec
- gzip -3: 23MByte in 1.6sec
- bzip2 -9: 11MByte in 22sec
Notes:
- while xz 7..9 can be better, it also takes the decompressor a bunch more RAM (the man page summarizes this) and particularly on smaller files there tends to be little to no improvement
- note that 4..6 moves into the area of comparable to gzip and bzip2 (compressing a bunch more than gzip, and being faster than bzip2)
- for this quite-compressible data, xz at lower compression settings moves into gzip and bzip2's area (depending a little on what you value more)
- so you can consider xz as giving you some more tradeoff
xz and parallelism
It seems xz's parallelism noticeably reduces compression (presumably due to how it splits up the data?) - compare: (same data as last)
xz -5 2.6MByte in 20sec xz -5 -T 6 4.3MByte in 5.5sec pixz -5 4.9MByte in 5.5sec
To see if that stays true on larger data, a test with the same on 2Gbyte of the same sort of data:
xz -5 25MByte in 205sec xz -5 -T0 38MByte in 30sec xz -7 23MByte in 306sec
And for 'is this worth it' comparison:
gzip -3 250MByte in 17sec
TODO: more comparison
Decompression speed
decompression - of the three, gzip decompresses faster than xz, but both gzip and xz decompress faster than bz2
Less practical nerdery
There is a whole set of compression types that are exercises in getting very high compression at the cost of a bunch of time.
They are usually barely command line utilities, let alone easy to use.
These include:
- PAQ family, see e.g. http://en.wikipedia.org/wiki/PAQ#History
- BBB ('Big Block BWT')
- SR2 (symbol ranking)
- FLZP (bytewise LZP)
- SLIM
- Ocamyd (Dynamic Markov)
- PIMPLE2
...and many others.
Other
Lossless methods
Entropy coding (minimum redundancy coding)
http://en.wikipedia.org/wiki/Entropy_coding
Mostly per-symbol coding - where a symbol are logical parts of the input, regularly fixed-size things, e.g. a character or byte in a file, a pixel of an image, etc.
Probably the most common symbol-style coding, contrasted with variable-run coding, which is often dictionary coding.
Modified code
Huffman
See also:
Adaptive Huffman (a.k.a. Dynamic Huffman)
See also:
Shannon-Fano
Similar to Huffman in concept and complexity.
A little easier to implement, so a nicer programming exercise, yet Huffman typically performs better.
See also:
Elias, a.k.a. Shannon-Fano-Elias
See also:
Golomb coding
http://en.wikipedia.org/wiki/Golomb_coding
Rice coding
https://en.wikipedia.org/wiki/Golomb_coding#Rice_coding
universal codes
In itself mostly a propery https://en.wikipedia.org/wiki/Universal_code_(data_compression)
Fibonacci coding
http://en.wikipedia.org/wiki/Fibonacci_coding
Elias gamma coding
https://en.wikipedia.org/wiki/Elias_gamma_coding
Exp-Golomb (Exponential-Golomb)
http://en.wikipedia.org/wiki/Exponential-Golomb_coding
Range encoding
http://en.wikipedia.org/wiki/Range_encoding
Arithmetic coding
https://en.wikipedia.org/wiki/Arithmetic_coding
Dictionary coding
Dictionary coders work by building up a dictionary of "this short code means this longer data".
The method of building that dictionary varies, and most of the work is in making a dictionary that is efficient for a given set of data.
Deflate/flate
Primarily combines LZ77 and Huffman coding.
Used in ZIP, zlib, gzip, PNG, and others.
See also RFC 1951
Modern compressors may still produce standard deflate streams for compatibility. For example, ZIP's deflate is still the most uniquitously supported coding in ZIP files, #zopfli can be used as a drop-in in gzip (and in theory zip).
(ZIP also uses Deflate64 (see its #compression_methods), which is a relatively minor variation)
LZ family and variants
Referring to an approach that Abraham Lempel, and Jacob Ziv used, and various others since.
LZ77 and LZ78 (a.k.a. LZ1 and LZ2)
Refer to the original algorithms described in publications by Lempel and Ziv in 1977 and 1978.
http://en.wikipedia.org/wiki/LZ77_and_LZ78
LZW
Short for Lempel, Ziv, Welch, refers to an extension of LZ78 described in 1984 by Terry Welch.
Used in various place, including GIF, TIFF, Unix compress, and various others.
Limited adoption because implementations were due to patents from 1983 to 2003 (or 2004, depending on where you lived).
See also:
LZSS
Lempel-Ziv-Storer-Szymanski (1984)
Used by PKZip, ARJ, ZOO, LHarc
http://en.wikipedia.org/wiki/Lempel-Ziv-Storer-Szymanski
LZMA
Lempel-Ziv-Markovchain Algorithm (since 1998)
Still much like LZ77/deflate, but with smarter dictionary building(verify).
Default method in 7Zip. An optional method in some new versions of the ZIP format.
http://en.wikipedia.org/wiki/LZMA
LZMW
LZW variant (letters stand for Miller, Wegman)
http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch#Variants
LZAP
LZW variant (letters stand for all prefixes)
http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch#Variants
LZWL
Syllable-based LZW variant
http://en.wikipedia.org/wiki/LZWL
LZO
Lempel-Ziv-Oberhumer
Built for very fast decompression (and can do it in-place(verify)).
Compression speed more at a tradeoff, though typically used at lower compression settings, to e.g. get ~half the compression ration of gzip at ~four times the speed, because usually the point is a CPU-cheap way to store more data / move it faster.
You can also make it work harder to get compression ratio ant time quite comparable to gzip -- while retaining the high-speed low-resource decompression.
https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Oberhumer
LZF
Largely comparable to LZO, but designed for more constrained memory use (verify)
LZ4
Basically a variant of LZO.
Fast to decompress.
Compression too prefers speed over compression ratio (by not being as exhaustive as typical DEFLATE style compression).
You can think of it as a "what compression can I get at minimal CPU cost?",
in that it can compress at hundreds of MByte/s per modern core (order of magnitude).
It is an option in things like ZFS, a filesystem that's biased to archiving,
so where a little CPU and memory overhead on writes are considered worth it.
ZFS also tries to (quickly, early in the block(verify)) detect when no compression should be used at all, which adds to the 'minimal CPU cost' aspect.
Comparison between LZO and LZ4 seems to be:
- LZ4 is done faster
- LZO can compress a little better
- LZO uses less memory
https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)
LZ4HC
Variant of LZ4
- similarly cheap decompression speed
- typically 20% better compression, but at 5x to 10x slower compression speeds
- no fast decision to not compress (verify), so e.g. in ZFS you generally don't want it over LZ4
...so it's meant to save storage cost for e.g.
- archiving and backups on hosts that otherwise don't use their CPU much.
- write-once-read-many cases,
...where explicit compression (e.g. throw it at bzip2 or xz) may be equally sensible. (there seems no hurry to get it into ZFS)
LZX
(1995) Used in Amiga archives, Microsoft Cabinets, Microsoft Compressed HTML (.chm), Microsoft Ebook format (.lit), and one of the options in the WIM disk image format.
http://en.wikipedia.org/wiki/LZX
LZRW
Lempel-Ziv Ross Williams (1991), which itself has seven variants.
http://en.wikipedia.org/wiki/LZRW
LZJB
Letters stand for Jeff Bonwick. Derived from LZRW (specifically LZRW1)
Used in ZFS.
http://en.wikipedia.org/wiki/LZJB
ROLZ
Reduced Offset Lempel-Ziv
http://en.wikipedia.org/wiki/Reduced_Offset_Lempel_Ziv
zstd
zstd (Zstandard, extension may be .zst) is a LZ77-family compressor, aiming to be a fast compressor and in particular a fast decompressor, at zlib-like compression levels.
Very roughly:
- at one end it has lzo / lz4 speeds, but compresses more
- at the other end it has zlib/bzip2 compression (but faster than those)
Part of that speed seems to come from multicore being designed in, rather than an afterthought.
There are other footnotes, e.g. that of compression, decompression memory.
zstd makes sense within certain systems, because of the lower CPU (and/or; tradeoff with) higher speed at almost the same compression as zlib.
See also:
Combinations
zopfli, brotli
These are meant for the web.
Zopfli produces a standard DEFLATE stream (often then wrapped in zlib/gzip)
but is an implementation that tends to squeeze out a little more compression than zlib/gzip at their max settings,
by exhausting more alternatives.
This makes it much slower, and not useful for on-the-fly compression or one-time transfers.
It is intended for things you compress once, then serve many many times, e.g. on the web for woff, PNG and such, while not changing the the DEFLATE compression standard at all.
Brotli is built for fast on-the-fly text compression, and has dictionaries primed with some very common strings on the web. It tends to shave 20% more off for HTML, JS, and CSS como.
Which in compression ratio is comparable to zlib/gzip/DEFLATE on its lighter settings, yet brotli's is meant to decompress faster, and have low and well bound resource requirements (not unique features, but can be a useful combination for some purposes).(verify)
See also:
Techniques
Burrows-Wheeler transform (BWT)
See Searching_algorithms#Burrows-Wheeler_Transform
Not a compression, just a way to change a string - but one that tends to put similar patterns near each other, and one that is fully revesible.
This makes the BWT useful as a step in compression algorithms, and is indeed used: bzip2 is built around it.
Prediction by Partial Matching (PPM)
PPM is the concept, PPM(n) is PPMd
Dynamic Markov Compression (DMC)
Context tree weighting (CTW)
Context Mixing (CM)
Choices
High compression, who-cares-about-speed
Consider file-serving purposes you compress once, then serve millions of times.
You pay mostly for transfer, so it pays to try your hardest once to compress it.
This is a big argument behind
- xz for archiving (more specialized) and for things like package repositories, when you pay for both storage and network transfer
- pre-compressed zlib through HTTP (not as efficient, but standard)
- and specifically zopfli for your most-served stuff.
This means high compression ratios save the most cost, and you want to look at bzip2, xz, 7z, LZMA and PPMd methods, ZPAQ.
While these tend to not parallelize as brilliantly as some specialized variants, if you have cores to spare you may get more compression within the same timespan, which can be a no-brainer. See e.g. #Speed_versus_compression,_including_parallelized_variants for the kind of tradeoffs -- but test on your own data.
On CPU and memory difference under higher compression
The higher compression settings tend to mean "search more exhaustively within the given data to see if there is a slightly better solution".
This means the highest compression almost always means significantly lower compression speed - but often little to no difference in decompression speed.
But, under methods that involve lookup tables, it can make a real difference to the memory used in decompression and compression.
And with quickly-diminishing returns in compression.
The quickly diminishing returns are quickly not worth the extra CPU time and/or memory use.
Or responsiveness. Things like LZ4 have no parameters, which is good when you want guarantees like 'is always done in what seems like negligible time'.
It does mean lower compression, because they only really focus on redundancy within a shortish window, rather than the best possible solution for a large amount of data.
This is also part of why there is often more difference between compression methods, than there is within a method at different compression levels.
There is still an argument for cases where you will read/send/serve the result many times - it will eventually make up for the initial investment (or when storage is more expensive to your bottom line than CPU power, which should be basically never). This is why things like png crushers exist, the reason behind zopfli, and why you may wish to sometimes use zx or lzip for software distribution.
Generic tools may not even let you try this, e.g. bzip2 settings barely matter, and for good reason.
High speed, reasonable compression
(see above)
These can make sense in systems
- where CPU is generally not the bottleneck, so spending a little CPU on light data compression is almost free
- and/or tend to write as much as they read (e.g. storage/database systems)
For example
- for database and storage systems, since these are often relatively dedicated, and tend to have a core or two to spare.
- and/or when it lowers network transmission (bandwidth and interrupt load)
- LZO (see above)
- LZF (see above)
- LZ4
- snappy (previously zippy) seems similar to LZ4
- zstd and gipfeli seems to intentionally aim at more compression than LZ4/snappy at speeds still higher than typical DEFLATE'. (verify)
Note that some of these have the footnote "If you've got a core or two to spare."
Which on dedicated storage servers is typically true, while on workhorses might take a few percent off your general performance.
See also:
Lossy methods
Note that lossy (sometimes 'lossful') methods often employ lossless coding of some perceptual approximation.
Transparency is the property of compressed data being functionally indistinguishable from the original - a lack of (noticeable) artifacts. Since this is a perceptual property/quality, this isn't always easy to quantify, and for varied contexts (images, music, etc.) there may be multiple estimating metrics.
Ideally, lossy compression balances transparency and low size/bitrate, and some methods will search for the the compression level at which a good balance happens.
Floating point compression
Floating point data, particularly when higher-dimensional, has a tendency to be large.
And also to have smooth, easily modelled patterns.
Very clean floating point data may compress well, even using completely general-purpose compression
that doesn't know it's floating point data.
Noisier data less so, but methods that know it's dealing with floating point may do better.
Note that floating point compression can be essentially lossless.
There is often also the option for lossy compression. Conceptually, you can shave bits off the mantissa. The tradeoff may be worth it, particularly when you know the data was noisy to start with.
Perceptual coding
Perceptual coding is matching your compression to the auditory or visual system, allowing you to minimize perceived distortion at a given compression ratio.
Most audio, image, and video compression uses some variant of this. Because what we call "better compression" usually means "looks better at a size I consider reasonable" instead.
Audio
Image
Video
Digging into some details
LZW notes
For related compression methods, see Compression notes
General
Properties:
- relatively easy to implement.
- If you're the learn-by-doing kind, you may like this one - you'll make a halfway decent compression algorithm and understand it decently
- Lots of little edge cases before it's production-ready, though.
- on compression level:
- for readily compressible things, LZW is pretty decent - repetitive text like logs may become ~10% of original, general text perhaps 40%
- in general it's not as good as other common and still relatively simple algorithms (like zlib)
- uncompressable data in LZW coded form may become approx. 110%..130% the original size (varying with the encoder's response to dictionary growth)
- memory overhead is bounded, and small to moderate (...for a codebook-style compression algorithm)
- Beyond the input/output (which can be streamed), there's only the dictionary
- the amount of entries in the dictionary is bounded by agreement on the maximum code width. For the somewhat standard 12-bit maximum code width it's fairly small, and even for, say, 20-bit dictionaries it's often manageable.
- the maximum size of the entries is effectively bounded by the way it is built up. It's not particularly minimized, though there are more-efficient-than-naive implementations
Concepts, terms, and some implementation/behaviour notes
- Glossary
bit width - the amount of bits currently needed to be able to use all indexes in the table, which is ceil(log2(dictionary_size+1))
code width refers to the amount of bits needed to code entries from the dictionary, which depends (only) on the (current) amount of entries.
- In practice the maximum size and code width are often decided before you start.
code size can be a bit ambiguous, often referring to the code width, or sometimes the dictionary size.
Dictionary (a.k.a. codebook, code table, translation table) refers to a lookup table between uncompressed strings of data, and the symbol (a.k.a. codepoint) it is assigned.
The 'dictionary size refers to the amount of entries in the dictionary.
Index usually refers to a position within the dictionary (referring to the string of data that position of the dictionary stores). The compressed data consists of a list of such indices.
Symbol usually refers to the indices. Below, symbols and indices are synonymous. (In a formal-language-theory context it could also refer to the elements of the uncompressed data. Sometimes 'elements' is used to refer less ambiguously to units of the uncompressed input.)
Encoding input data to symbols/indices
Decoding symbols/indices to uncompressed data
Between symbols and bitstream
Other notes
See also (LZW)
- http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
- http://warp.povusers.org/EfficientLZW/
- Web search: 'Tips for the LZW decoding algorithm'
Specific implementations
In GIF
Unsorted
Low memory footprint
Bi-level images
JBIG2
- ITU T.88, Information technology - Coded representation of picture and audio information - Lossy/lossless coding of bi-level images
See also:
Note that DjVu's CJB2 is a variation of this format.
CCITT Group 3 and 4
(Note CCITT is now ITU-T)
Primarily:
- ITU-T T.4, Standardization of Group 3 facsimile terminals for document transmission
- ITU-T T.6, Facsimile coding schemes and coding control functions for Group 4 facsimile apparatus
Published in 1988, parts of the fax standard were updated in 1992, 1993. Actual fax machines need to conform to a larger number of standards.
See also
containers and/or compression
Pack200
Pack200 is a compression scheme for Java (specified in JSR 200), which seems to specialize in informed/logical compression of bytecode, (and bytecode within JARs)
Pack200 plus gzip compress these things better than gzip alone.
It's used with JWS, among other things
See also:
- http://www.jcp.org/en/jsr/detail?id=200
- http://docs.oracle.com/javase/1.5.0/docs/guide/deployment/deployment-guide/pack200.html