LZW notes: Difference between revisions

From Helpful
Jump to navigation Jump to search
m (Redirected page to Compression notes#LZW notes)
Tag: New redirect
 
Line 1: Line 1:
{{stub}}
#redirect [[Compression_notes#LZW_notes]]
 
For related compression methods, see [[Compression notes]]
 
=General=
 
Properties:
* relatively easy to implement.
:: If you're the learn-by-doing kind, you may like this one - you'll make a halfway decent compression algorithm and understand it decently
:: Lots of little edge cases before it's production-ready, though.
 
* on compression level:
:: for readily compressible things, LZW is pretty decent - repetitive text like logs may become ~10% of original, general text perhaps 40%
:: in general it's not as good as other common and still relatively simple algorithms (like zlib)
:: uncompressable data in LZW coded form may become approx. 110%..130% the original size (varying with the encoder's response to dictionary growth)
 
 
* memory overhead is bounded, and small to moderate (...for a codebook-style compression algorithm)
:: Beyond the input/output (which can be streamed), there's only the dictionary
:: the amount of entries in the dictionary is bounded by agreement on the maximum code width. For the somewhat standard 12-bit maximum code width it's fairly small, and even for, say, 20-bit dictionaries it's often manageable.
:: the maximum size of the entries is effectively bounded by the way it is built up. It's not particularly minimized, though there are more-efficient-than-naive implementations
 
 
===Concepts, terms, and some implementation/behaviour notes===
{{stub}}
 
: '''Glossary'''
 
'''bit width''' - the amount of bits currently needed to be able to use all indexes in the table, which is ''ceil(log<sub>2</sub>(dictionary_size+1))''
 
'''code width''' refers to the amount of bits needed to code entries from the dictionary, which depends (only) on the (current) amount of entries.
: In practice the maximum size and code width are often decided before you start.
 
'''code size''' can be a bit ambiguous, often referring to the code width, or sometimes the dictionary size.
 
'''Dictionary''' (a.k.a. '''codebook''', '''code table''', '''translation table''') refers to a lookup table between uncompressed strings of data, and the symbol (a.k.a. codepoint) it is assigned.
 
The '''dictionary size'' refers to the amount of entries in the dictionary.
 
'''Index''' usually refers to a position within the dictionary (referring to the string of data that position of the dictionary stores). The compressed data consists of a list of such indices.
 
'''Symbol''' usually refers to the indices. Below, symbols and indices are synonymous.  (In a formal-language-theory context it could also refer to the elements of the uncompressed data. Sometimes 'elements' is used to refer less ambiguously to units of the uncompressed input.)
 
<!--
 
A complete system consists roughly of:
* Compression: Uncompressed input elements (usually bytes) &rarr; symbols &rarr; bits in a compressed bitstream
* Decompression: Bits in a compressed bitstream &rarr; symbols &rarr; uncompressed output
 
The LZW algorithm primarily describes the transition from uncompressed uncompressed data and symbols.
 
Going between symbols and bit-packed data can be considered a separate implementation detail,
though in implementations is often entangled as it allows for some cleverness and optimization.
 
 
 
: '''LZW variations'''
 
There is no single LZW implementation.
 
There are a few major design decisions (meaning there are clever variations incompatible with other forms),
and specific-case assumptions.
 
They include:
* The initial state (of encoder and matching decoder), which consists of:
** the minimum/initial code width {{comment|(actually slightly fuzzy concept. If you make entries for 8-bit input (2<sup>8</sup>=) 256 entries ''and'' have control codes, then your code width ''in practice'' will always be at least 9)}}
** dictionary entries for all possible input values. Often directly implied by the minimum code width. In general-purpose-byte-data coders this is 256, for specific purpose coders it can be smaller, sometimes larger.
** dictionary entries for the control codes your implementation uses (if any)
** the maximum code width, because it's usually a good idea to put a bound on the dictionary size (implementation decision, see notes below)
 
* Method (and variations), including:
** response to control codes
** dictionary bounding method
** what the dictionary value elements are -- usually bytestrings, but for e.g. GIF codes the color indices it contains, so a 16-color GIF case they would be strings of 4-bit values
 
 
These decisions/assumptions must match between encoder and decoder (often fixed by a standard, specific implementation, or de facto convention), which means distinct implementations often cannot interpret data from other implementations.
 
 
Sometimes a few aspects may be communicated instead of predefined (e.g. GIF's minimum code size, because the data's bit-width may be smaller than a byte, for few-color images).
 
 
 
Encoders can do things that affect the encoded stream but ''not'' the decoded stream.
 
For example, variable-bit LZW compressors that support clear codes may choose to use it after a bunch of incompressible data, to bring down the bit-width and/or make the dictionary more fit for the next chunk of data.
(you can even do some fancy analysis to get slightly better compression)
 
 
 
: '''Dictionary growth and bounding'''
 
The dictionary grow while compressing (and decompressing).
 
As more entries are added to the dictionary, it will occasionally cross a two-power size, which means the code width will increase by one. 
 
 
Because usually the compressed string is a list of symbols at (in variable-bitwidth coders) the current bit-width, the applicability of the dictionary entries plays a role in coding efficiency.
 
If the dictionary grows without more entries being used (which can happen when random-looking data passes through), the compression ratio will lessen.
 
 
The way entries are added makes them potentially useful on average, but not optimal.
As the amount of dictionary entries increases, so does the amount of entries that won't be useful to the compression, or used at all.
 
Because of this, there is often enough an upper bound placed on the dictionary size.
Once that limit is reached, there are a few different things you can do:
 
* Perhaps the simplest option is to not alter the dictionary anymore, but this won't adapt to new patterns
** A bad idea for long inputs (with different sort of data at different points), which will all be coded with codes built from a small portion at the start
 
* Also simple is to reset the dictionary to its initial state, effectively starting over
** Sub-optimal in that the compression ratio immediately after a reset will be lowered
** Decent for long unpredictable inputs, since it will adapt to new patterns, effectively coding them as independent parts
 
* Something else reproduced on both sides, such as continuously replacing the least-visited entry in the codebook
** Handy in that it will refine the quality of codebook, removing unused and little-used entries. Still no
** not exclusive with the clear code - the encoder can try to be very clever to see what combinations give good compression
 
 
 
: Separation of algorithm and bitpacking
 
You might like separation in function, and therefore try to separate the symbol generation step from the step that packs the resulting bits into bytes (in the encoder) and ouy of bytes (in the decoder).
Yes, you'ld buffer all the data which you could in theory stream out, but that's a smallish price most of the time.
 
Two major factors in whether and where you can do this are dictionary bounding and the fixed/variable bitpacking choice.
 
Variable-width coding means you can't in the decoder because it reads chunks of sizes, the size determined by the current dictionary state.
You can still do it in the encoder, ''if'' it somehow marks where the width changes happen.
 
If you support clear codes or any other dictionary bounding that invalidates dictionary entries (anything other than "keep as-is"), you cannot separate the steps in the encoder either because the bit sequence you map to depend on the ''current'' state of the dictionary (whereas without this just the final state would be enough).
 
 
So for a general purpose LZW implementation you cannot separate the two - although with generator-style code (if possible) you can get a decent enough separation while streaming.
-->
 
===Encoding input data to symbols/indices===
<!--
 
Start with a dictionary that has one entry for each possible input value.
For byte input (very common) that would means 256 entries.
For some specific cases/implementations it may differ, for example 16-color GIF images that would mean 16 entries (and an initial code width of 4).
 
In man real-world implementations there may be also be some more initial codes, often control codes, most commonly two: a clear code and an end of input code (details later).
 
 
The encoding procedure is to, while there is still data in the input:
* while the current string is in the dictionary (and we are not at end of the input), read and add the next input element
** you now have the symbol for the longest matching string in the dictionary, and (except at end of input) the next character in the input
* emit the symbol into the symbol stream
* (in all steps but the last) Add a new entry to the dictionary, namely the concatentation of:
** the string that we just emitted the symbol for
** the next input element that we read earlier
 
 
For example, assume our input is <tt>ababababa</tt> - and for brevity of the dictionary in the example, that the input can only contain <tt>a</tt> and <tt>b</tt>.
 
The initial dictionary (assuming our system uses control codes) might look like:
0 a
1 b
2 clear code
3 EOI code (a.k.a. stop code)
 
After five iterations...
*  a is the longest match, b is next. Emit 0, add ab (becomes 4)
*  b is the longest match, a is next. Emit 1, add ba (becomes 5)
*  ab is the longest match, a is next. Emit 4, add aba (becomes 6)
* aba is the longest match, b is next. Emit 6, add abab (becomes 7)
*  ba is the longest match, and there is no more data. Emit 5.
 
...we have:
output stream: 0 1  4  6  5
(representing) a b ab aba ba
and the dictionary contains:
0 a
1 b
2 clear code
3 EOI code
4 ab
5 ba
6 aba
7 abab
 
Notes:
* our output symbol list is a list of integers
** which given our current codebook size could fit within a 3-bit code width (but more on bit coding later).
* entry 7 was never used. This happens regularly
-->
 
 
===Decoding symbols/indices to uncompressed data===
<!--
Given matching initial state (assumptions) and method, the decoder can reconstruct both the dictionary {{comment|(because they are all concatenations of existing entries)}} and, since it will has the same entries at the relevant time, also the data.
 
Emulating the encoder's buildup of the dictionary is conceptually slightly more complex than the steps in the encoder, largely because the decoder is slightly behind in entries.
 
The decoding procedure: {{comment|(o short for old, also often named prefix)}}
* read symbol into ''o''
* output dictionary's data for symbol ''o''
* While there are still input symbols:
** read symbol into ''n''
** if n currently in dictionary:
*** output dictionary's data for symbol ''n''
*** add new dictionary item: concatentation of ''o'' and ''n''[0]
** if n not currently in dictionary: (we figure out what the encoder would have made it{{verify}})
*** add new dictionary item: concatentation of ''o'' and ''o''[0]
*** output dictionary's data for the just-added entry
** o=n
 
 
Continuing the example above, there are five inputs to handle:
* 0 - Necessarily present. Emit a
* 1 - Present. Emit b,  add  ab  (comes from a+b, becomes 4) (o now 1, 'b')
* 4 - Present. Emit ab, add  ba  (comes from b+a, becomes 5) (o now 4, 'ab')
* 6 - Not present. Add and emit what it would be: aba  (comes from ab+a, becomes 6) (o now 6, 'aba')
* 5 - Present. Emit ba, Add abab (comes from aba+b, becomes 7)
...which produces the original output, and the same dictionary as the encoder made.
 
 
Note:
* The control codes were there partly to point out that they're not in the way - you do have to be consistent between encoder and decoder.
-->
 
===Between symbols and bitstream===
<!--
The basic algorithm only says something like 'output the symbol/index to the compressed stream', not how.
 
Classroom demonstrations may just show the symbol/index stream as-is since they're primarily demonstrating the algorithm,
but since we're talking about compression we may as well try to do it efficiently.
 
 
The simplest decent method is to choose the smallest bit-width that can be used to code all of the dictionary (''ceiling(log<sub>2</sub>dictionary_size''), and use bit-packing to pack that into a resulting byte stream.
For example, if the dictionary grew to 700 entries, then we'll need 10 bits to represent each symbol/index. We can pack the bits into bytes (every 8 codes will fit into 10 bytes) and be done with it.
 
This can be called '''fixed-width bit packing'''. Note that coding does not start producing that stream until we know the dictionary will no longer grow (often usually meaning 'when it's done with all the input'{{verify}}), and the decoder will need a mechanism to know that size before it starts decoding.
 
 
 
Now, for an 8-bit-minimum, 12-bit-maximum setup, for the first few hundred input items the dictionary could still be coded with 8 and 9 bits - but for inputs of a few kilobytes, fixed-width coding is already likely to settle on 11 or 12 bits.
 
The idea of '''variable-width bit packing''' (a.k.a. ''dynamic LZW'' and various other names) is that you could just use the ''current'' width, as long as encoder and decoder agree on what that is -- and they do, because the dictionary will be the same at the same point in the stream.
 
...with one difference: the decoder is always one directory entry behind, which means that in decoding you need to test whether the ''next'' dictionary add (''given'' to be about to happen) would grow the dictionary's bitwidth.
 
Note that variable width coding implies the initial code width, meaning the encoder does not have to buffer all the input before it can start writing the compressed stream, and doesn't have to communicate the bit width to the decoder.
 
 
Notes:
* Doing all this bit packing and unpacking, particularly doing it efficiently, takes a little care.
 
* Different LZW implementations use different bit order within bytes{{verify}}. For example, GIF is LSB, TIFF is MSB.
 
* A system supporting clear codes can keep in mind that its use means resetting to shorter-width codes. How and when this would be good for compression is usually not so easy to prove or calculate, but it may well be a good idea.
 
-->
 
 
 
===Other notes===
<!--
 
: Dictionary optimizations
 
Dictionary entries always build on existing dictionary entries.
This suggests you can save memory by storing entries in a data structure that implies the concatenations, rather than stores each of them fully.
 
More interesting is creating a structure that allows the algorithm to do the searches it needs efficiently.
 
Depending on the intended balance between speed and memory conservation, you could use a tree, a trie, and various other solutions.
 
 
 
: end of input
 
Various LZW implementatiosn have a special code called end of input (EOI), for one of a few reasons
* Make it explicit when a compressed stream ends - without this it is only implicit, detected by the fact that the input data doesn't have enough ''bits'' left to produce another value.
 
* Paging, which seems to refer to coding multiple distinct chunks like ''CLEAR, chunk's contents, EOI'' (...and finishing when there is no more data)  {{verify}}
 
 
 
: Initial dictionary size, input value use
 
Consider an 8-color paletted image, in memory using only byte values 0..7. GIF would take that and create an initial dictionary with eight colors, a clear code and EOI code. That's 10 entries, starting off with 4bpp coding.
 
It does mean one extra byte needs to be sent, to imply the amount of initial entries, but this is negligible overhead and worth it for simple images:
 
The difference between the 8-color case and treating the input as byte is 245 initial entries you would never use and can now use for compression codes. In variable-width bitpacking you may also spend a while before you even get to 8-bit or 9-bit coding (with control codes) you would use for byte input.
 
 
 
And if you don't mind creating yet another LZW variant then something similar goes for text.
For example, English text in ASCII regularly only uses byte values within 10..126. You could send both of those values along (spending 2 extra bytes) and have both sides create 117 initial values in the dictionary. This means a bunch more entries in the dictionary that can be used for compression-code entries, and some time spent coding the input as 8-bit instead of 9-bit codes.
 
Or you could write a bitmask to signal which of the actual byte values are used - which is a more generic solution and for some input might be slightly more effective - though it does mean (256/8=)32 extra bytes to send.
-->
 
 
 
===See also===
* http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
* http://warp.povusers.org/EfficientLZW/
* [http://www.google.com/search?q=tips%20for%20the%20lzw%20decoding%20algorithm Web search: 'Tips for the LZW decoding algorithm']
 
=Specific implementations=
===In GIF===
<!--
 
GIF image data is
* initial code width (in a single byte. Some details to this; see notes below)
* packed bits-in-bytestream (wrapped in the 255-byte-at-most-at-a-time blocks that GIF uses in general)
 
LZW-specific details
* variable-width bit packing
* little-endian bit order
* max code width is 12      {{comment|(presumably minding the memory constraints of the time)}}
* initial state of dictionary is:
** 2**initialcodewidth entries
** CLEAR code
** EOI code (used to mark the end of this data)
** Note: Because of those two control codes, the code width increases by one before any data is handled, and the the first compression code is 2+2**initialcodewidth
* Stream is started with a clear code
 
 
: Historically
 
Using the LZW algorithm from 1984 to 2004 violate the patent for it,
but it seems that (LZW being a specific member of the LZ family meant that) you could decompress without such trouble,
and not using the LZW method to compress data meant that you could fairly easily create LZW data that doesn't actually compress the data - so you could avoid patent problems by writing uncompressed GIFs.
 
A number of libraries and programs still do this (though it wouldn't be very hard to add a real LZW encoder/decoder now).
 
 
Compression step consists of:
* decide code size (usually the same number as the color bits, so up to 8, but can't be 1 because of LZW details; must be coded with 2 bit codes; see below)
* Compress image pixels to compression codes
* the compression codes are a bitstream, with variable amounts of bits; pack it into bytes.
* write into the file
** write the code size
** wrap the bytes into blocks of at most 255 bytes, preceded by the block's length
 
Uncompressed GIFs use the map-to-themselves codes.
It also inserts an occasional clear code so that the bit width doesn't increase -
You'll need 9 bits per pixel to code an 8bpp GIF - which means uncompressed data is actually larger than the pixel data it comes from - but without clear codes it might widen to 10, 11, or 12.
 
How often there ''needs'' to be a clear code depends on the current code size (2<sup>codesize</sup>-2).
Two-color (1bpp) images are the worst case; they need to be 2 bits to allow for the control codes, ''and'' every other code needs to be a clear code to avoid it becoming 3bpp. This means two 2-bit codes per 1-bit pixel, four times the original size.
 
For 8bpp it's 9 bit width and a clear code every 254 codes, which means it's about ~13% larger than the original data.
 
 
 
-->

Latest revision as of 16:38, 14 July 2023