Image file format notes

From Helpful
Revision as of 14:01, 10 June 2019 by Helpful (Talk | contribs)

Jump to: navigation, search
Notes related to processing the file structure or contents of image, sound, or video.

Notes on encoding video ·

Image file format notes · Image processing notes ·

Sound programming, sound coding, sound codecs ·

These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Method / limitations:

  • Basically stores DCT factors, so best for things that resemble gradients, which is why it's better at photos than at line art, text, and other sharp edges
Compression comes mainly from quantizing those factors and huffman-coding them (some more from some image-aware cleverness)


See also:

Notes on JPEG file structure

At one point I wanted to identify whether a JPEG was lossless or not, so wanted to parse the basic file structure (without decoding the image).

A JPEG file is a series of segments.

All segments start with FF and a single marker byte.

A few markers imply a fixed-structure, fixed-size segment, including:

  • D8 - Start Of Image (SOI). No data follows.
Probably all JPEG files start with the SOI
  • D9 - End Of Image (EOI). No data follows.
Some JPEGs have been seen without EOI, but typically it's there.
  • D0 through D7 - restarts. No data follows.
  • DD - sets restart interval. Size is (verify)

Most other segments mention their size, and the structure is:

  • 0xFF (1 byte)
  • marker (1 byte)
  • datasize (2 bytes, big-endian)
...which refers to the whole payload, including these two size-indicating bytes
  • data (byte length as indicated by datasize, minus two)

Some common markers:

  • 0xe0 - APP0, most typically used to store some generic file metadata (e.g. JPEG version)
  • 0xe1 - APP1, most typically used to store EXIF metadata metadata
  • 0xc0 - start of frame for baseline JPEG
  • 0xc2 - start of frame for progressive JPEG
  • 0xc4 - huffman tables
  • 0xdb - quantization tables
  • 0xfe - comment
  • 0xda - start of scan

Some specific notes:

  • Start Of Frame:
These also mention image size, channels, bits per channel
There were originally two Start Of Frame choices. While those are still the most common, there now seem to be approximately nine. (verify)
Typical JFIF files usually store 8bpc RGB pixel data
There are 12-bit JPEGs, part of the spec but only supported by some specialist software (medical applications).
You can omit the chrominance data for a grayscale JPG (verify)
you can use a paletted mode
  • APP0
some webpages suggest the JPEG header is always the SOI followed by APP0 (i.e. FF D8 FF E0, you may recognize it as ÿØÿà). JPEG FIF does demand this (in part because APP0 is the only way to identify a JFIF file in the first place), both its presence and position, so probably 99%+ of JPEGs have an APP0 there. File-structure-wise it is optional, so a truly generic parser ought to just parse the segments as they appear.
  • quantization tables (0xdb)
Most JPEG compressors store two tables, one for Y, and one for both Cr and Cb.
Digital cameras are more likely to store three.
each segment may contain one or more quantization tables. Typically all tables are in the same segment, but there are cases where they are stored separately.


JPEG originally only referred to a compression method, and file container format.

It did not fully specify how to treat that data as an image, though. For this and a few other reasons, JFIF {{{1}}} was created as a better-defined standard that programs could more easily support.

What we now call JPEG files are basically always JFIFs, because that's what anyone sane would want.

See also:


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A lossless compression format that compresses both sharp contrast and photographic images decently.

If you don't need animation, this is generally preferable to GIF, as it's true-color, lossless, and has more advanced transparency.

All web browsers now support it (IE was the last to solve a list of related bugs, but has pretty decent support since IE7), and operating systems have fairly widely accepted it now (e.g. for icons).

On quality and size:

  • for diagram-style images, compression is comparable with GIF, and is better quality than JPEG
  • for gradient/photographic images its compression compares to high-quality-settings JPEG -- sometimes larger than JPEG, because JPEG can fudge over fine noise, while PNG necessarily preserves it.
For web content, smaller size can be more important than quality, which is a tradeoff you can't make with PNG, and you'll probably still want JPEG.

Method / limitations:

  • lossless compression
  • paletted, greyscale, or RGB
  • No animation in the standard (see APNG, also MNG, but they are not widely supported and probably won't be anytime soon)
  • Many specific formats, with different features/support in older browsers (see e.g. [1])

Compressors include

Lossfull (smart palette quantization)

See also:

See also:

PNG structure notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

File structure is broadly:

  • Eight-byte header: 137 80 78 71 13 10 26 10
  • an IHDR chunk
  • content chunks
  • an IEND chunk

A chunk is:

  • 4-byte uint - data field's length (of payload, so excluding length, type, and crc fields)
  • 4-byte uint - type
  • data bytes
  • 4-byte CRC

On types:

mostly text, but uses some specific bits to convey further information.
including whether it is critical or ancillary
Types allow extension of the format with older readers and/or editors, having tules about known / unknown safe-to-copy / unsafe-to-copy critical / ancillary chunks


Animated PNG is an unofficial extension to PNG, which produces images backwards compatible with PNG decoders that don't know about APNG (they will show the first frame).

APNG is supported by a number of browsers and some image editors, but not widely enough to use it professionally.

See also:


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

MNG (Multiple-image Network Graphics) is a close relation of PNG (written by the same team).

It allows animation - to address the possibility for its replacement of GIF in web and other areas.

Not as adopted as PNG. Though a number of things have adopted it without us really noticing, web browsers generally haven't.

See also:


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

TIFF has grown over time in terms of what it can store. Version 6 of the TIFF specs (~1992, and the document encompasses previous versions), the version most things today use, allows things like varied bit depth, color spaces, compression, making it a useful lossless format.

TIFF is a container format, and a bunch of specific tags detailing the format of image blocks.

TIFF files have blocks of metadata (IFDs) storing flexible sets of tag-value pairs, and pointers to data within the file. This allows for one or more images (e.h. stored thumbnail previews), varied metadata, and lets you effectively tell simpler decoders "just ignore the bits you can't parse"

Notes on TIFF file structure

A TIFF file starts with a header, then an initial IFD (basically 'a block of metadata').

The 8-byte header is

  • 2 bytes indicating the endianness used in the IFDs
namely "II" (0x49 0x49) for little-endian ('Intel') and "MM" (0x4D 0x4D) for big-endian (Motorola)
namely the integer 42
  • int32: absolute offset of the first IFD

IFDs are:

  • int16: amount of entries in this IFD
  • each entry is 12 bytes
    • 2-byte tag (the thing TIFF is named for), see also TIFF tag reference
    • 2-byte type
    • 4 bytes: number of vals
    • int32: absolute offset to value -- or, for datatypes taking <=4 bytes, the data itself
  • int32 absolute offset of the next IFD

IFDs often point to image data.

They can also point to further IFDs, by describing their byte positions. The ability to chain IFDs allows for multiple images and, well, anything else.

The simplest TIFFs have just one data block, pointed to by one IFD pointing to it, and explaining just enough detail on how to interpret that data (pixel data, compression, etc).

Beyond an IFD chaining to the next IFD, you also can get properties that point to private IFDs or to non-TIFF structures - at which point it's a custom use of TIFF. (this is all perfectly valid because the offsets in the standard parts just don't point to said custom data blocks)

Digital camera's raw files (Canon's CRW and CR2, Nikon's NEF, Pentax's PEF, Samsung's PEF, Sony's SR2 and ARW, Fuji's RAF) may e.g. embed EXIF, IPTC, XMP, GPS, have an IFD for a tiny thumbnail, an IFD and data for a medium thumbnail, and an IFD for the raw data which may not be easy to interpret, etc.

While most of these are based on TIFF structure, they are not expected to be opened in generic TIFF viewers/editors, so can can deviate somewhat. (Some do not, strictly speaking, conform to TIFF6, e.g. in required-tag details) Usually only the company that made it knows quite everything about it.

(...which is one reason for DNG, which is a more open standard (also based on TIFF))

Notes on compression

See also:

See also


Method / limitations:

  • Paletted, with a maximum of 256 colors, so best for line art and grayscale images, not for photo quality
    • (technically you can use many frames and many palette updates to render one image in more colors, but it's inefficient and not all decoders like doing this)
  • Two versions, GIF87a and GIF89a, the latter being the most interesting for showing animation, because it added delays
  • Uses LZW - which was patented in 1985 and expired in 2003 or 2004 (varying with country)
in theory you could skip compression but this made for rather large files
so this spurred development of PNG. PNG now seems the handier choice for most things, except animation (GIF is still the only widely supported image formats that does this).

Notes on GIF file structure

Broad structure

For the structure and possible order of related chunks, see diagrams like that on [2] - can be interesting when writing a parser.

  • The GIF file header consists of:
    • The six-byte file header, either
    • the logical screen descriptor (LSD), which is seven bytes, and which must be present, so is effectively part of the file header

  • A sequence of one of the following three:
    • an application extension (see below)
    • a comment extension
    • stuff that you can see, which itself can consist of:
      • An optional GCE (Graphics Control Extension, see below) before one of the following two:
        • Image Desciptor + optional local color table + image data
        • (optional) Plain Text Extension (rarely used)
  • A 0x3B byte (; character) as a GIF file trailer.

Notes on extension blocks:

  • The Netscape extension block is common for animated GIFs (though not required)
    • This extension may only be placed directly after the LSD
    • It only really contains the loop instruction
  • The Graphic Control Extension (GCE) is very common in animated GIFs, as it mentions each frame's duration (and also its transparency)
  • Comment extension - used for for attribution, 'created by this-and-that software', and such
  • Plain text extension mentions text to be rendered. Rarely used, and a number of decoders simply ignore it.


The Logical Screen Descriptor consists of:

  • 2 bytes: logical screen width (LSB-order 16-bit int)
  • 2 bytes: logical screen height (LSB-order 16-bit int)
  • 1 byte with bit packing:
    • 1 bit: whether the global color table is present (and follows). Most of the below is only relevant if it is.
    • 3 bits: color resolution - "Number of bits per primary color available to the original image". 000 means 1, 001 means 2, ..., 111 means 8
    • 1 bit: sort flag - whether the colors in the global color table are sorted in order of decreasing use, which can help the decoder
    • 3 bits: size of color table - which will take 2(value+1) bytes
  • 1 byte: which color is the background color (only meaningful if there is a global color table)
  • 1 byte: aspect ratio. Most images use 0, meaning using the storage's ratio (square pixels). If nonzero, the spec says the ratio should be calculated like (byte-value + 15) / 64
  • optional global colour table, if mentioned by the LSD (almost every animated GIF has this)

Extension blocks consists of:

  • ! character (0x21 byte)
  • Extension label, a single byte to distinguish between GCE, comment extension, plain text extension, application extension
  • data (many first include a size, but comment doesn't)
  • 0x00 byte

Image Desciptor + optional local color table + actual image data consists of:

  • optional Graphics Control Extension (GCE)
    • has local pallette? (if not, the global one will be used)
    • has transparent color? -
    • user input required? - usually ignored
    • duration, in hundreths of seconds, though
    • removal controls (verify) (do nothing, replace with previous image, replace with background color)
  • image data block (OR plain text block, for text to be rendered, but these are generally not supported, and so ingored)
    • , character (0x2C byte) signifies start of Image Descriptor
    • 2 bytes: left postition on logical screen (LSB-order 16-bit int)
    • 2 bytes: top postition on logical screen (LSB-order 16-bit int)
    • 2 bytes: frame wdidth (LSB-order 16-bit int)
    • 2 bytes: frame height (LSB-order 16-bit int)
    • 1 byte: packed
      • 1: color color table present/follows
      • 1: interlaced?
      • 1: sort flag (see notes on global color table)
      • 2: reserved
      • 3: size of local color table
    • optional local color table
    • image data

See also

Other notes

(Animated) GIF rendering

The LSD defines a logical image. You can consider it the canvas that individual frames draw on. Individual frames need only draw onto part of it (optimized animated GIFs use that fact). Also, each frame may have its own distinct disposal method (specified by the GCE that is probably before it).

Because of transparency, background, frame disposal, and the partial-frame-update detail, animated GIFs should be rendered incrementally for them to display correctly. Viewers do this, but this can be a significant detail to programmatically reading (and efficiently writing) GIF files.

You can sometimes do without that, if all frames update all of the logical screen, and have no transparency (but that often means the GIF is larger than necessary).

Because most real-world GIF rendering is done onto a true color target, you can have a true-color GIF, by updating it in small frames (at worst 256 pixels at a time, each with their own palette) with zero delay. This is rarely done, because there's rarely any point.

Roughly speaking, rendering is done by:

  • Allocating the logical screen
  • For each frame:
    • update only non-transparent regions (each frame may have transparency and may have a unique color index for it)
      • ...using the local color table if present, or the global color table if not
    • dispose as specified by the GCE (usually "don't do anything", but not always)

Individual/overall palette

For an animated gif, each frame can have its own optimized pallete.

In theory, adaptive palette choice makes each individual frame as close to the original as possible, but this is only useful for slow-updating things, for example diagram sequences.

For short-interval, video-like GIF (probably the more common case), the same independent adaptiveness means every frame may have the same input color be somewhat different, which can look like continuous flickering.

You can instead calculate a palette based on all the input frames and use it for all frames. This means fewer colors and more change, but the transitions will be smooth. Of course, this only works for shortish sequences that don't change scene too much - the more colors, the worse it'll look. (Note that it does mean you can omit the local palette)

See also


Can be seen as

  • an improvement on JPEG (~25% smaller at similar perceptual quality)
  • now with some PNG-like features (~25% smaller when lossless)

Supports alpha channel.

Animation was added later (though support was not immediate), also making it an alternative over GIF (different quality/size choices).

Created by google for the web, it's mainly read by browsers (e.g. Firefox and Edge) came much later, Safari still doesn't (though seems to maybe plan it?[3]), so you always want to use it it in a *<picture> with alternatives, or JS polyfill, or whatnot.

A bunch of image editors also open it.

Related to VP8 and WebM

See also:

PBM, PNM, etc.

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

These files store uncompressed Per-pixel values, in either ASCII or a simple binary form. They are very easy for programs to write and read, and arguably very CPU-efficient as there is no decompression to do. They are of course not space-efficient, and they contains almost no metadata.

They are probably most interesting as an intermediate pixel format, as there are converters to/from many formats, which means you can rely on external conversion instead of dozens of specific libraries. Seen used in quick-'n'-dirty experiments, in scripts, and such.

PNM is a collective name for:

  • PBM ('portable bit map'), two-color (1 bit per pixel)
  • PGM ('portable gray map') (typically 8 bits per pixel, 16-bit also seen in later versions(verify))
  • PPM ('portable pixel map'), color in the form of r,g,b triplets (typically 8 bits per channel, 16 also seen in later versions(verify))

They are often implicitly BT.709 coded, with gamma applied. Some programs allow specifying/assuming non-gamma'd form too.

The 16-bit variants imply some endianness decisions for the binary forms - which isn't standardized, though MSB

PNM files start with two-byte magic, in ASCII

  • P1 - PBM, ASCII
  • P2 - PGM, ASCII
  • P3 - PPM, ASCII
  • P4 - PBM, binary
  • P5 - PGM, binary
  • P6 - PPM, binary
  • P7 - PAM, binary

PAM is a generalization of PNM, which adds depth and tuple type to the basic metadata, and only uses binary form.

The netPBM package deals with:

  • PNM (see above), and
  • PAM, a more general version of pbm-style formats, which can store all PNM formats, and some more options.

See also:


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

DjVu is a free and open file format. It has different codecs for bitonal and gradient images, and applies well to high-resolution scanned text, manuscripts, magazines, line art, and such, for which is reaches noticeably higher compression than most general-purpose/photo image formats (and/or, at the same size, better legibility than various other formats).

It can optionally contain a text layer (often OCRed).

Layers and types

Document and document pages are layered/composited (see also ITU-T T.44), and each layer can be coded in a different way. In some cases, those layers may have come from different color channels from the same original image.

DjVu's coders are probably best at are two-tone images, such as text. The separation between background layer(s), foreground layer(s), and masks helps - it allows e.g. highly compressable two-color coding for the text and a different choice for, say the paper texture (for scanned text).

Images and documents

DjVu images are valid single-page DjVu documents.

Multi-page DjVu files are either Bundled (into a single file), or Indirect (where the main document is an index pointing to individual DjVu document files, which is handier when serving individual pages sparsely, a little less handy for sending things around).


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Command examples

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Viewing, converting

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
  • Browser plugins
  • Separate viewers WinDjView and MacDjView [4]
  • There are also various viewers for PDAs, phones and such


  • CGI on-the-fly conversion
  • If you have a to-PDF-printer such as the Adobe's, PrimoPDF, doPDF(?) or similar, you can create a PDFs using any DjVu viewer (or anything else) that can print.

See also

See also:


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
  • WMF (Windows MetaFile): Primarily calls into Windows's GDI[5] library
  • EMF: more commands than WMF
  • EMF+: calls into GDI+ (newer version of GDI)
  • .wmz is gzipped wmf
  • .emz is gzipped emf

Since they mostly call an API, they basically executables and are easier to support on windows than elsewhere. For the same reason, there have been security exploits of the API using these image files.

The format is not very common, though various programs can still export it, and it's seen in contexts like clipart libraries.

The ability to load/convert on non-windows:

  • libwmf allows rasterization, and has some command line conversions, like wmf2eps and wmf2svg
  • openoffice imports it
  • UniConvertor [6]
  • inkscape? (depends on wmf2svg from libwmf?)

See also:

RAW photo formats

See also:


LDF: Bi-tonal format

WBMP (Wireless Bitmap) - monochrome, 1 bit per pixel data.

JPEG 2000