✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

This is meant primarily as a technical overview of the codecs in common and/or current use (with some historical relations where they are interesting, or just easy to find), without too many details; there are just too many old and specialist codecs and details that are not interesting to most readers.

Note that some players hand off reading/parsing file formats to libraries, while others do it themselves.

For example, VLC does a lot of work itself, particularly using its own decoders. This puts it in control, allowing it to be more robust to somewhat broken files, and more CPU-efficient in some cases. At the same time, it won't play unusual files as it won't perfectly imitate other common implementations, and it won't be quite as quick to use codecs it doesn't know about; in these cases, players that hand off the work to other things (such as mplayerc) will work better.

Container formats

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Containers are file types that usually allow various streams of various types and using various codecs

Some relatively general-purpose container formats include:

AVI (Audio Video Interleave)

A (a RIFF derivative; see also IFF), and was quite common for al long while, though that shifted in part because it was not ideal for MPEG-4 video tracks, VBR MP3 audio tracks, and some other things, as this older format does not really allow this without some hacks, that may be convention but not standard. Many AVIs in the wild violate the AVI standard - but play fine on most (computer) players.

Derived:

Files with the .divx extension are usually AVIs (...containing DivX video)

Google Video (.gvi) files use MPEG-4 ASP and MP3 in a mild variant on AVI container [1] (and do not really exist anymore)

MKV (Matroska Video)

An open standard, preferred by some, as it a fairly well designed many-stream format, and because it allows subtitle embedding, meaning you avoid hassle related to external subtitle files.

Ogg

Ogg is a container format - see also Ogg notes - and an open standard.

Extension is usually .ogg, or .ogm, though .ogv, .oga, and .ogx are also seen.

Note that initially, ogg often impied Ogg Vorbis: Ogg containers containing Vorbis audio data.

Ogg Media (.ogm) is an extension of Ogg, which supports subtitle tracks, audio tracks, and some other things that make it more practical than AVI, and put it alongside things like Matroska.

Ogg Media is not really necessary and will probably not be developed, in favour of letting Matroska become a wider, more useful container format.(verify)

Proprietary/minor/other

A number of container formats support only a limited number of codecs (sometimes just one), particularly if they are proprietary and/or specific-purpose.

Such container formats include:

Flash video (.flv) [2]

NUT (.nut), a competitor to avi/ogg/matroska [3]

Quicktime files (.mov) are containers, though without extensions to quicktime, they support relatively few codecs. In recent versions, MPEG-4 was added.

ASF (Advanced Systems Format), a proprietary format from Microsoft, most commonly storing wma and wmv content, and sees little other use in practice (partly because of patents and active legal protecting). [4]

RealMedia (.rm)

DVR-MS (.vro) (verify) [5]

DivX Media Format (.dmf)

Fairly specific-purpose:

Digital Picture Exchange (.dpx) [6]
Material Exchange Format (.mxf) [7]

Smacker (.smk), used in some video games [8]
Bink (.bik), used in some video games [9]

ratDVD

DVD-Video

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

tl;dr: data in MPEG-2 PS, some restrictions, and some DVD-specific metadata/layout around it.

A VIDEO_TS directory with VOB, IFO, and BUP files are, in a fashion, a container format as they are the DVD-Video way of laying out:

metadata about steam data (chapters, languages of tracks, angles, etc.)
Video streams (usually MPEG-2, sometimes MPEG-1)
Audio streams (AC-3, MPEG-1 Layer II (MP2), PCM, or DTS)
Subtitle streams (bitmap images)

(note: The AUDIO_TS directory is used by DVD-Audio discs, which are fairly rare. On DVD-Video discs, this directory is empty, and the audio you hear is one of the streams in the VOBs.)

IFO stores metadata for the streams inside the VOB files (e.g. chapters; subtitles and audio tracks). BUP files are simply an exact backup copy of the IFO files (to have a fallback for a scratched DVD).

VOB files are containers based on MPEG-2 PS, and store the audio, video, and image tracks.

VOB files are segmented in files no larger than 1GB, which was a design decision meant to avoid problems with filesystem's file size limits (since the size of a DVD was larger than many filesystems at the time could deal with).

DVD players are basic computers in that they run a virtual machine. DVD-Video discs with menus run bytecode on that, although most such code is pretty trivial if you consider the potentially flexibility of the VM -- there are a few DVD games, playable by any DVD player.

See also:

http://en.wikipedia.org/wiki/DVD-Video

http://www.mpucoder.com/DVD/vobov.html

http://en.wikipedia.org/wiki/DVD-Audio

Stream identifiers (FourCCs and others)

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

When container formats can store more than one video codec, they want to be able to indicate the format (codec) used in each stream.

For example:

AVI uses FourCCs, a sequence of four bytes used in AVI and a few others - usually four printable ASCII characters
MPEG containers mostly just contains MPEG video (...but there're a bunch of details to that)
Matroska (mkv) uses another system, CodecID, a flexible-length string.
Ogg doesn't have an identifier system, instead asking all available codecs whether they can play the data given to them (initially just the first frame from a stream).

Video codecs

Earlier formats

Various RLE-like formats, used primarily for very simple animations

Flic (.fli, .flc), primarily video-only files used in Autodesk Animator

http://wiki.multimedia.cx/index.php?title=Flic_Video

https://en.wikipedia.org/wiki/FLIC_(file_format)

Cinepak

http://wiki.multimedia.cx/index.php?title=Cinepak

Intel Indeo:
- Indeo 2 (FourCC: RT21) [10]
- Indeo 3 (FourCC: IV31 for 3.1, IV32 for 3.2) [11]
- Indeo 4 (FourCC: IV40, also IV41 for 4.1) [12]
- Indeo 5.0 (FourCC: IV50) [13]

MJPEG is mostly just a sequence of JPEG images (FourCC: AVDJ, AVID, AVRn, dmb1, MJPG, mjpa, mjpb). [14] [15]

There are also some variations on this theme

Early H.26x family (related to MPEG and ITU standards. H.something is the ITU name):

H.261, a format made for videoconferencing over ISDN. Came before the more widely used H.263 [16]

H.262, which is identical to part of the MPEG-2 standard

H.263: for videoconferencing (seen used in H.323).
- See also [17]
- Also the base of various other codecs, including:
  - VIVO 1.0, 2.0, I263 and other h263(+) variants
  - Early RealVideo
  - Sorenson (including early Flash video)
    - Sorenson 1 (SVQ1, svq1, svqi)
    - Sorenson Spark (Used in Flash 6, 7 and later for video)
    - (Sorenson 3 (SVQ3) was apparently based on a H.264 draft instead)

H.261, H.262, H.263 hasn't really been relevant since the late nineties, early noughties, due to better things (both realtime and non-realtime cases) being available, such as MPEG-4, though the transition was more gradual than that. Consider e.g. Flash, RealVideo, WMV:

Nineties to noughties and later

MPEG-4 part 2, a.k.a. MPEG-4 ASP
- DivX, XviD, and many versions, variants, and derivatives
- FourCC: [18] mentions 3IV2, 3iv2, BLZ0, DIGI, DIV1, div1, DIVX, divx, DX50, dx50, DXGM, EM4A, EPHV, FMP4, fmp4, FVFW, HDX4, hdx4, M4CC, M4S2, m4s2, MP4S, mp4s, MP4V, mp4v, MVXM, RMP4, SEDG, SMP4, UMP4, WV1F, XVID, XviD, xvid, XVIX

(See also MPEG4)

H.264, a.k.a. MPEG-4 AVC, MPEG-4 Part 10
- FourCC depends on the encoder (not too settled?).
  - ffmpeg/mencoder: FMP4 (which it also uses for MPEG-4 ASP, i.e. DivX and such. It seems this is mostly meant to send these files to ffdshow(verify), but not all players understand that)
  - Apple: avc1
  - Various: H264, h264 (verify)
  - Some: x264 (verify)

On2 (Duck and TrueMotion also refer to the same company):

VP3 (FourCC: VP30, VP31, VP32): [19]. Roughly in the same class as MPEG4 ASP. Open sourced.

VP4 (FourCC: VP40) [20]

VP5 (FourCC: VP50): [21] [22]

VP6 (FourCC: VP60, VP61, VP62): Used for some broadcasting [23] [24]

VP7 (FourCC: VP70, VP71, VP72): A competitor for MPEG-4 [25] [26]

Xiph's Theora codec is based on (and better than) On2's VP3 [27]

AV1

basically a successor to VP9, from the Alliance for Open Media

considerably more efficient than VP9, H.264, somewhat more efficient than H.265 (HEVC)

open, royalty-free (like VP9 and Theora), which makes it less cumbersome to adopt than H.264, H.265 license-wise

WebM
- VP8 or VP9, plus Vorbis or Opus, in Matroska

started by Google after acquiring On2

supported by all modern browsers (like H.264)

open, also royalty-free (unlike some parts of MPEG4)

Quality is quite comparable to H.264

Dirac [28] is a new, royalty-free codec from the BBC, and is apparently comparable to H.264(verify).

H.265, a.k.a. HEVC

Containers that meant different things over time

RealVideo uses different names internally and publicly, some of which are confusable:

RealVideo (FourCC RV10, RV13) (based on H.263)

RealVideo G2 (fourCC rv20) used in version 6 (and 7?) (based on H.263)

RealVideo 3 (FourCC rv30) used in version 8 (apparently based on a draft of H.264)

RealVideo 4 (FourCC RV40, and also UNDF) is the internal name/number for the codec used in version 9. Version 10 is the same format, but the encoder is a little better.

The H.263-based versions (up to and including 7) were not very impressive, but versions 9 and 10 are quite decent. All are proprietary and generally only play on RealPlayer itself, unless you use something like Real Alternative.

Flash video [29] (preferred first, will play list) (verify)

Flash 6: Sorenson Spark (based on H.263)

Flash 7: Sorenson Spark

Flash 8: VP6, Sorenson Spark [30]

Flash 9: H.264, VP6, Sorenson Spark (and understands MP4, M4V, M4A, 3GP and MOV containers)

Flash 10: (verify)

Microsoft:

Windows Media Video [31], often in .wmv files (which are asf containers)
- version 7 (FourCC: WMV1) (based on MPEG-4 part 2)
- version 8 (FourCC: WMV2)
- version 9 (FourCC: WMV3)
RTVideo [32]
VC-1 [33]

Apple:

Quicktime [34]
- 1: simple graphics, and RPZA video [35]
- 5: Added Sorenson Video 3 (H.263 based)
- 6: MPEG-2, MPEG-4 Part 2 support. Later versions also added Pixlet [36] [37]
- 7: H.264/MPEG-4 AVC, better general MPEG-4 support
Internal formats like 'Intermediate Codec' [38] and ProRes [39]

Unsorted

Uncompressed Raw YUV [40]

Compressed YUV, e.g.
- HuffYUV (lossless, and easily over 20GB/hour)

RawRGB (FourCC: 'raw ', sometimes 0x00000000) [41]

Hardware formats: (verify)
- AVID
- VCR2
- ASV2

See also:

Pixel/color formats (and their relation to codecs)

Streaming, streaming support protocols

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

See Streaming audio and video

Subtitles

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Hardsubs are a jargon term that refers to subtitles that are mastered to be directly part of the video. They have no special status, mask any actual video. This avoids all support issues, and usually looks good, but they give no choice of language, or whether to display the subtitles or not.

Softsubs refer to separate subtitle data, historically often as a separate file with the same name and a different extension, and more recently as a part of container formats which support multiple streams (such as MKV), which can also store multiple different subtitles (e.g. languages) at once.

There are a number of formats, and not all file extensions are very obvious. Particularly things like .sub and .txt may be one of various formats.

Formats include:

Plain-text:
- Various, but most importantly:
- SRT (SubRip [42] [43]) - very simple (but not too well standardized, and some advanced features are not well handled by some players/editors)

Subtitle editors' internal formats (text, binary, xml, other), some of which became more widely used:
- SSA (SubStation Alpha, software of the same name) [44] [45]
- ASS (Advanced Substation Alpha, aegisub) - an extension of SSA ([46])
- AS5 (ASS version 5) [47]
- JacoSub [48] [49]
- XombieSub [50]
- AQTitle(verify) [51]
- Turbotitler (old) [52]

Image-based: (avoids font problems, larger)
- VOBsub (.sub and .idx) are subtitle streams as used by DVD (verify)
- MicroDVD (.sub), specific to the MicroDVD player [53]
- PRS (Pre-rendered subtitles) stores (PNG) images [54]

XML-based
- USF (Universal Subtitle Format) [55] [56], and XML format that is not very common outside of Matroska containers
- SSF (Structured Subtitles Format) is a newer XML-based format (apparently with no current major interest or support [57])

Other/unsorted (and other internal formats):
- VTT, a.k.a. WebVTT[58][59]
- SAMI (.smi) [60], often used in Korea
- DVD-based/derived (CC, SPU, VobSub)
- Karaoke formats (.lrc, .vkt, )
- MPSub (.sub), a format internal to mplayer [61]
- MPEG-4 Timed Text [62]
- Power DivX (.psb) [63]
- ViPlay Subtitle File (.vsf)
- Phoenix Japanimation Society (.pjs) [64] (old(verify))
- Subsonic (.sub) [65]
- ZeroG (.zeg) [66]
- Adove Encore (.txt) [67]
- MPL2 [68]
- VPlayer [69]
- Sasami Script (.s2k)
- SubViewer (verify)
- RT (verify)
- DVB (verify)
- Teletext (verify)
- LRC[70] (Lyrics, meant for lyrics/karaoke on audio)
- TTML [71]
- ARIB (Association of Radio Industries and Businesses)

Editors and other utilities:

Aegisub (ASS, SRT, SSA, exports to them and PRS)
Subtitle Editor (linux)
VisualSubSync (SRT, ASS, SSA)
KSubtile (linux, SRT)
Sabbu
Subtitle workshop
JacoSub
XombieSub
Jubler
Subtitle Processor
SubRip rips subtitles from DVD format (into SRT(verify))
SubEdit (player)

Player support

MPEG notes

Overview

MPEG refers to such a large sprawl of things that it's hard to have a overview of even just the interesting bits.

It refers to a wide set of standards, which includes various media containers, media codecs for audio, video, and more.

MPEG can refer to one of three sets of standards: MPEG-1, MPEG-2, and MPEG-4 (3 was skipped to avoid confusion with MP3, which refers (typically) to MPEG-1 audio layer III), formats that can store video and/or audio streams, and a little more.

Note that

MPEG-1 is ISO 11172 (the important bits mainly from 1993)

MPEG-2 is ISO 13818 (mainly from 1995..1997)

MPEG-4 is ISO 14496 (since 1999, parts being updated and added more steadily)

it's often shorter and easier to refer to, say, "MPEG-4 Part 3" rather than "ISO/IEC 14496-3"

People probably use MPEG-4 largely refer to the standard as a whole, the container format, and the ASP and AVC video codecs

There's a lot of parallel and adopted standards going on, e.g.

It's all ISO/IEC since MPEG-2 (fairly common on technical standards)

some parts are also basically identical to ITU standards (also common)

parts adopted from elsewhere (e.g. TwinVQ from NTT(verify)), sometimes in tight cooperation, sometimes more for the use of standardization (verify)

There are updates over many years

some more structural, like how MP4 file format (Part 14, from 2003) revises/extends and replaces the earlier definition in Part 12, and updates one bit of Part 1 (verify)

Parts

(More widely known / used / interesting parts bolded)

MPEG-1 parts:

                                   last   
                         since    change    
Part 1   ISO 11172-1      1993     1999     Systems     (basically refers to the container structure and 
                                                               details like syncing and multiplexing)
Part 2   ISO 11172-2      1993     2006     Video, basically H.261
Part 3   ISO 11172-3      1993     1996     Audio             including what we know as MP3
Part 4   ISO 11172-4      1995     2007     Compliance testing     
Part 5   ISO 11172-5      1998     2007     Software simulation

MPEG-2 parts

Part 1    ISO 13818-1     1996     2016     Systems, also H.222.0          
Part 2    ISO 13818-2     1996     2013     Video, basically H.262 (very similar to H.261 / MPEG-1 Part 2, 
                                                                          adds details like interlacing)
Part 3    ISO 13818-3     1995     1998     Audio, much like MPEG-1 Part 3, e.g. extending channels but in a 
                                                         backwards compatible way (hence MPEG-2 BC)
Part 4    ISO 13818-4     1998     2009     Conformance testing     
Part 5    ISO 13818-5     1997     2005     Software simulation     
Part 6    ISO 13818-6     1998     2001     Extensions for Digital Storage Media Command and Control (DSM-CC)
Part 7    ISO 13818-7     1997     2007     Advanced Audio Coding (AAC) (a.k.a. MPEG-2 NBC Audio, 
                                               non-backwards-compatible, to contrast with MP3/Part 3)
Part 9    ISO 13818-9     1996              Extension for real time interface for systems decoders     
Part 10   ISO 13818-10    1999              Conformance extensions for DSM-CC
Part 11   ISO 13818-11    2004              Intellectual Property Management and Protection (IPMP) 
                                               on the MPEG-2 system

Part 8 was 10-bit video but was never finished because of little interest.

MPEG-4 parts

Part 1    ISO 14496-1     1999     2014   Systems   including the MPEG-4 file format 
Part 2    ISO 14496-2     1999     2009   Visual    including Advanced Simple Profile (ASP), better known as DivX
Part 3    ISO 14496-3     1999     2017   Audio     including AAC, ALS (lossless), SLS, 
                                                    Structured Audio (low bitrate), HVXC and CELP (speech)
Part 4    ISO 14496-4     2000     2016   Conformance testing    
Part 5    ISO 14496-5     2000     2017   Reference software     
Part 6    ISO 14496-6     1999     2000   Delivery Multimedia Integration Framework (DMIF),
                                            basically an API that abstracts out network transfers
Part 7    ISO 14496-7     2002     2004   Optimized reference software for coding of audio-visual objects
Part 8    ISO 14496-8     2004            MPEG-4 content over IP networks, think RTP, SDP transport, some guidelines
Part 9    ISO 14496-9     2004     2009   Reference hardware description
Part 10   ISO 14496-10    2003     2016   Advanced Video Coding (AVC), a.k.a. ITU-T H.264
Part 11   ISO 14496-11    2005     2015   Scene description and (Java) application engine  (updates parts of Part 1
Part 12   ISO 14496-12    2004     2017   ISO base media file format   
                                            (largely the same as ISO 15444-12, JPEG 2000's base format).
Part 13   ISO 14496-13    2004     2004   Intellectual Property Management and Protection (IPMP) Extensions     
Part 14   ISO 14496-14    2003     2010   MP4 file format, a.k.a. MPEG-4 file format version 2. 
                                            (Based on Part 12 and updates (clause 13 of) Part 1)
Part 15   ISO 14496-15    2004     2020   details of carrying Part 10 videos in a Part 12 style container (verify)
Part 16   ISO 14496-16    2004     2016   Animation Framework eXtension (AFX), describing 3D content
Part 17   ISO 14496-17    2006            Streamed, timed subtitle text
Part 18   ISO 14496-18    2004     2014   Font compression and streaming (for Part 22 fonts)
Part 19   ISO 14496-19    2004            Synthesized texture stream, for very low bitrate synthetic video clips.
Part 20   ISO 14496-20    2006     2010   basically a variant of Scalable Vector Graphics (SVG)
Part 21   ISO 14496-21    2006            a Java API for building multimedia apps
Part 22   ISO 14496-22    2007     2017   Open Font Format  (basically OpenType 1.4)
Part 23   ISO 14496-23    2008            Symbolic Music Representation (SMR)
Part 24   ISO 14496-24    2008            Audio and systems interaction 
                                            (details to putting MPEG-4 Audio in the MPEG-4 file format)
Part 25   ISO 14496-25    2009     2011   3D Graphics Compression 
Part 26   ISO 14496-26    2010     2016   Audio Conformance     
Part 27   ISO 14496-27    2009     2015   3D Graphics conformance     
Part 28   ISO 14496-28    2012            Composite font representation     
Part 29   ISO 14496-29    2014     2015   basically restricted profiles of Part 10 (H.264) video
Part 30   ISO 14496-30    2014            more subtitle related stuff (timed text and other visual overlays)
Part 31   ISO 14496-31    not finished    Video Coding for Browsers (VCB)
Part 32   ISO 14496-32    not finished    Conformance and reference software     
Part 33   ISO 14496-33    not finished    Internet video coding

A bit more real-world

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

"okay, beyond the standards, what parts of MPEG do I mostly care about?"

You might focus on the more popular codecs, containers, and products, e.g.

MP3 (MPEG-1, MPEG-2)

AAC (defined in MPEG-4 Part 3, among lesser known things)

MPEG-4 ASP (better known as DivX, XVid)

MPEG-4 AVC (better known as H.264)

The MPEG-4 container format involves multiple parts, in that

Part 1 is effectively v1

Part 14 updates Part 1 and builds on Part 12 (retroactively versioning Part 1 as v1 and Part 14 as v2), and is almost identical to QuickTime File Format

Part 12

Physical disc formats often follow a restricted version of a specific MPEG variant. Consider:

based on MPEG-1 include VCD, XVCD and KVCD

based on MPEG-2 include DVD, SVCD and KVCD

DVDs follow MPEG-2 Program Stream format(verify), with specific restraints on the codec and (effectively) on encoder settings.

BluRay is mostly MPEG4 (and a few MPEG2)

EVO (EVOB) [72] on HD-DVD allows [73] allows MPEG-2 video or MPEG-4 AVC

You can often get away not caring about lesser-used / niche, like

TwinVQ

ALS

various whole Parts

Note that

As a name, MPEG-2 can be vague, in that it can refer to

the whole standard (ISO 13818),

or specifically to also specifically to MPEG-2's video compression (MPEG-2 Part 2 a.k.a. H.262 -- itself basically an extension of MPEG-1 Part 2).

As a name, "MPEG-4 video" can be vague, in that it can refer to
- MPEG-3 Part 2 includes MPEG-4 ASP, better known as DivX, XVid, and some others
- MPEG-4 Part 10, MPEG-4 AVC is known as as H.264 instead (which is its ITU name) to lessen confusion.

Also relevant

Also relevant is MPEG-H, largely for its HEVC (High Efficiency Video Coding) / H.265 / MPEG-H Part 2

which is can be seen as a further development on MPEG-4 AVC / H.264 that gives similar quality at half the bitrate.

and it can be carried in MPEG TS, MPEG-4 containers (and others)

...and once we're on such comparison, also VC-1, VP8 and VP9, and such

See also:

MPEG audio codecs

MPEG1(/2/2.3) Audio Layers

An MP3 file is an audio elementary stream based on (not wrapped in PS or TS), and without a header. (though I've seen a few ADTS streams with .mp3 extension)

so detecting something as an MP3 becomes more of an "if you can detect a few valid, consistent packets back to back" thing

https://en.wikipedia.org/wiki/MPEG_elementary_stream#General_layout_of_MPEG-1_audio_elementary_stream

The audio stream itself is the one mainly defined in MPEG-1 Part 3. For music we almost exclusively user Layer III (which MP3 is named for) because it defines the higher bitrates/sample rates than layer I and II.

MPEG-2 (BC) audio added a few lower bitrates making it slightly more flexible but also only rarely useful (e.g. not for music).

not 100% backwards compatible, but the change to decoders is so small that to most players, MPEG-1 audio and MPEG-2 audio streams are almost identical in practice.

There is also unofficial "MPEG 2.5", which extends the options on the lower bitrate end. This is another step in 'probably supported, but not technically backwards compatible'. It is also not in the standard.

MPEG-2 (BC) audio adds 5.1 channels, but designed around MPEG-1 core to the point an MPEG-1 decoder will play the stereo channels and ignore the rest(verify)

AAC

Effectively has a few version

MPEG-2 Part 7, defines three profiles

AAC-LC / LC-AAC (Low Complexity)

AAC Main

AAC-SSR (Scalable Sampling Rate) [74]

MPEG-4 Part 3 adopts and extends MPEG-2's AAC

minor changes and more profiles

AAC-LD (Low Delay)

Later added HE-AAC (sometimes aacPlus), a higher-efficieny variant

Later yet HE-AAC v2, that added parametric studio

MPEG video codecs

MPEG-4 ASP (e.g. DivX) was a good jump forward at the time, a few years later MPEG-4 AVC (a.k.a. H.264) was another.

MPEG-2 sounds old, and at lower bitrates indeed looks much worse than AVC (or ASP), so when pressed for space, or playing video online (higher transfer costs per quality) you would always choose a more space-efficient codec - including some newer than AVC when support is good enough. (Which, because stupid political games, is easily summarized as "not")

At high enough bitrates, most any video codec becomes transparent.

And it matters that more refined encoders did a lot better years later, while still adhering to the same bitstream format.(verify)

A such, when space is not an issue, MPEG-2 is still perfectly serviceable, and it can matter that MPEG-2 may encodes and decodes faster because it's simpler.

But that only applies to a few uses, like streaming where CPU cost is higher than transmission cost (is now basically never, though there are still a few potential uses like "in my event streaming setup)".

The case I'm getting at is Blu-Ray. DVDs and Blu-Rays were basically were both were sized to fit high-bitrate MPEG-2 movies (of different resolutions, because different eras)(verify).

Given Blu-Rays are large and a disc is for a single movie, you can throw bitrate at which at which MPEG-2 going to be just as transparent as AVC or VC-1 (and any differences are more down to codec-specific artifacts, or probably more from things like film transfers).

Which is why it actually makes a lot of sense to support MPEG-2 in Blu-Ray discs.

At the same time, while both may be equally fine for video, most Blu-Rays are AVC[75], probably for more practical reasons.

Container format notes

MPEG 1, 2, and 4 (MPEG in general) defines a number of stream types, which work out to be fairly general-purpose.

There are restricted forms in various places, often to guarantee playability on hardware players.

MPEG 1 and 2

Elementary Streams, Transport Streams, Program Streams:

An (MPEG) Elementary Stream is a single type of data, e.g. audio, or video.

They will typically be the output of a single audio encoder, video encoder. They are streams in that any one stream can be given to the according decoder.

It may or may not have a header explain exactly what format the data is in

And neither encoder or decoder needs to think about how this is stored or multiplexed or such - that's up to the container.

A Packetized Elementary Stream (PES) is such an elementary stream split into smallish packets, largely so that you can guarantee you go through it using a smallish buffer

The structure of an elementary stream depends on the codec being used, and even the presence of a header at the start is effectively optional, yet common for practical reasons.

For example:

An MPEG-2 video elementary stream will start with a header

e.g. 00 00 01 B3

https://en.wikipedia.org/wiki/MPEG_elementary_stream#Header_for_MPEG-2_video_elementary_stream

An MP3 file is a single audio elementary stream (not wrapped in PS or TS), and without even a header

It isn't strictly a PES either(verify), but if your goal is using a small buffer, the way it splits up into frames amounts to roughly the same thing

When playing back and/or transmitting streams (video, audio, captions, other), you often need to play them back synchronised.

MPEG2 makes a distinction between

Transport Streams (TS)

fixed bitrate, makes them easier to buffer, and easier to resync during less-reliable context, and more restrained demands on the decoder which is nice for hardware playback

at the cost of some coding efficiency

has some more error correction and synchronization pattern?(verify)

Program Streams (PS)

variable bitrate, can be more space-efficient

at the cost of buffering being a little more complex, and less resistance to unreliable media, potentially higher demands on the decoder.

Things like DVD, and most MPEG-2 video files you'll find, will be PS.

You'll find TS around broadcast, e.g. used in DVB.

Both can carry multiple PES streams, but in PS they are part of the same program, and TS can carry multiple programs (is this used, though?)(verify)

Syncing is a topic of its own.

In video production (e.g. studios) you would often use genlocking, in which the video signal itself is used to sync another.

In transmission this is less practical and less flexible, and you often set up a clock source by description, and sync separate streams to that via their timestamps. This is what a Program Clock Reference (PCR) is about.

In a Single Program Transport Stream (SPTS) there is a single PCR channel.

In general, different programs may have to be synchronized differently.

In PS the timebase can be known ahead of time, and things fetched based on that timebase, and streams are implicitly on the same clock, because of how.

In TS

While a distinction could be made between MPEG-1 PS and MPEG-2 PS (being defined in MPEG-1 Part 1 see MPEG-2 Part 1), they seem to be so similar that a lot of code reads both{verify}} (though what they contain varies).

Transport Streams seem to have been introduced in MPEG-2(verify) defined as part of MPEG-2 Part 1, so TS implicitly refers to MPEG-2.

Also, the distinction seems to be mostly an MPEG-2 thing, with MPEG-4 presumably being more fine grained?(verify)

MPEG-2 Tansport Stream (TS)'s Transport Packets (TP) have a fixed size of 188 bytes, so you can be fairly sure you didn't lose sync if you skip 188 bytes and see another synv byte (0x47)

In a Single Program Transport Stream (SPTS), there will be one Program Clock Reference (PCR) channel that recreates one program clock for both audio and video. The SPTS

MPEG-2 PS

In MP3 you are typically dealing with MPEG-1, sometimes MPEG-2, have sync bits and a header that lets you be pretty sure this one is correct, and that the thing that follows is also a frame.

http://www.img.lx.it.pt/~fp/cav/Additional_material/MPEG2_overview.pdf

MPEG-4 container

Uses a TLV style affair, with

a uint32 size (including the size and type, and contained boxes)

a uint32 type (often readable ASCII)

fields(chunksize-8 bytes of it)

possible contained boxes

MPEG-4 calls them Boxes (previously atoms), which helps visualize how they can be can be nested into a tree structure.

ftyp

mandatory (but if missing, should be parsed as if there is an ftype with mp41 brand)
major brand - uint32 (usually printable text) that lists the type of content
minor version - uint32, minor type to the brand
compatible brands -
An MPEG-4 stream put an ftyp box as early as possible, and in a file it is typically at the start, which is also useful as file magic (look for 'ftyp' at index 4. The size of this box vary a little due to the compatible-brand list). The four bytes after ftyp help identify the more specific kind of file it is. For example:

isom MP4 Base Media v1 (Part 12)

mp71 MP4 with MPEG-7 metadata

mp41 MP4 v1 (Part 1)

mp42 MP4 v2 (Part 14)

qt Apple QuickTime

mmp4 3GPP ('mp' referring to Mobile Profile) (there are a handful more 3GPP variants)

There are various more, see also ftyps.com

As an indication of nesting, a relatively minimal video file's first two levels may look something like

ftyp
moov
- trak (details about the video track)
- trak (details about the audio track)
mdat (no boxes; contents are referred to via sample tables under trak)

https://standards.iso.org/ittf/PubliclyAvailableStandards/c068960_ISO_IEC_14496-12_2015.zip

3GPP

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

3GPP (3rd Generation Partnership Project) is a wide term, grouping various mobile-related development.

In this context we perhaps care most about the MPEG-4 style container as used in mobile context (often referred to as 3GP and 3G2), which basically takes the MPEG-4 container format and removes some features you don't need there (making it easier to implement), and adds some things useful in this context.

https://en.wikipedia.org/wiki/3GP_and_3G2

MPEG streams can contain...

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

In practice it's more relevant what a stream can contain, and where you can use it (than how it's structured).

A MPEG-1 Program Stream can contain (presumably(verify))

MPEG-1 Part 2 video
MPEG-1 Part 3 audio

A MPEG-2 Program Stream can contain

MPEG-1 Part 2 video
MPEG-2 Part 2 video
MPEG-1 Part 3 audio
MPEG-2 Part 3 audio

And in theory (but rarely in practice)

MPEG-4 Part 2 video (also in TS)
MPEG-2 Part 7 audio (AAC)
MPEG-4 Part 3 audio (AAC)

MPEG-4 can contain

As per the ISO 14496 stadard:
- video: ASP (MPEG-4 Part 2)
- video: AVC a.k.a. H.264 (MPEG-4 Part 10)
- audio: MP3 (MPEG-4 Part 3) (verify)
- audio: AAC (MPEG-4 Part 3)
- audio: CELP (MPEG-4 Part 3)
- audio: ALS (lossless) (MPEG-4 Part 3)
- audio: SLS (MPEG-4 Part 3)
- audio: Structured Audio (MPEG-4 Part 3) (low bitrate)
- audio: HVXC (MPEG-4 Part 3) (speech codec)
- audio: CELP (MPEG-4 Part 3) (speech codec)
- audio: TwinVQ (MPEG-4 Part 3)

Other standards, or proprietary, or haven't figured out yet
- 3GPP (.3gp) and 3GPP2 (.3g2) are restricted version of MPEG-4 container and/or contents, made to be more easily supported by mobile devices, and can contain
  - video: H.263
  - video: VP8
  - audio: AMR-NB
  - audio: AMR-WB and WB+
  - audio: AAC (AAC-LC, HE-AAC v1, HE-AAC v2)
  - audio: MP3
- video: VC-1 (mostly just in Blu-Ray)
- video: HEVC a.k.a. H.265

Other transports

Online streaming is frequently MPEG-DASH[76] ('Dynamic Adaptive Streaming over HTTP'). (Meant as a standard that is less proprietary than Smooth Streaming (Microsoft), HDS (Adobe), or most others.)

DASH which breaks parts into short segments, where each segment (the minimum download unit, and usually a few seconds) can be served at different bitrates.

A manifest is transferred to tell the player where to find segments for each quality.

The client can choose the bitrate it thinks works best (balance between required speed and detail), and can typically switch seamlessly during playback.

You can have DASH in downloaded form, which will presumably be a sequence of moof,mdat fragments (rather than a typical MP4, which can be mostly just one big moov,mdat pair) (presumably it's also more specifically following 3GPP)

https://en.wikipedia.org/wiki/Dynamic_Adaptive_Streaming_over_HTTP

Frame rate, analog TV format, and related

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

(I'm not sure about all this - there is so much fuzzy and/or conflicting information out there)

Frame rates

Movies on media like DVD come in different frame rates. This does not matter to computer playability, so unless you are compressing or converting video, you probably want to ignore this completely.

Common rates

Some of the more common rates seem to be:

rate	common uses / suggests source	also referred to as	approx
24 (exactly)	used to shoot most film, and used in most cinema projection	film
24000/1001fps	usually an intermediate in conversion from film to NTSC color	'NTSC film'	23.976, 23.97
25 (exactly)	Speed of rasters transferred (not shown frames) in broadcasts such as PAL (except PAL M) and SECAM.	'PAL video',
30000/1001	the speed of rasters transferred (not shown frames) in (interlaced) broadcasts such as NTSC M (the most common NTSC form) and also also PAL M. Pre-1953 NTSC broadcasts was exactly 30.0fps)}}	'NTSC video'	29.97
30 (exactly)	Apparently the black and white variant of NTSC was exactly 30, and 30000/1001 was the hack upon that (verify). Exactly-30fps content is relatively rare(verify), because it's either pre-1953 NTSC TV, or modern digital things that just chose this(verify).
50 (exactly)	Can refer to 50 frame per second progressive, or 25 frame per second interlaced that is being played (and possibly deinterlaced) as its 50 contained fields per second (as e.g. in PAL and SECAM TV ((except PAL M))	'PAL film', 'PAL video', 'PAL field rate'
60000/1001 (verify)	The field rate of NTSC color. Can refers to NTSC color TV that is transferring interlaced rasters.	'NTSC field rate'

These are the most common, but other rates than these exist. For example, there is double rate and quad rate NTSC and PAL (~60fps, ~120fps; 50fps, 100fps), often used for editing, or e.g. as intermediates when converting interlaced material.

A framerate hints at the source of the video (24 is typically film, ~25 is often PAL broadcast, 30000/1001 is typically NTSC broadcast) and/or the way it is played (e.g. 50 and 60000/1001 usually means analog TV content, and possibly interlaced). There's a bunch of cases where you can't be sure, because there are some common conversions, e.g. 24fps film converted to 25fps and 29.97fps for broadcast and DVDs. And be careful of assumptions about interlaced/progressive/telecined. Note also that DVDs can mix these.

Movies still mostly uses 24fps, primarily because we're used to it. It looks calmer, and we associate 24fps with higher quality, partly because historically higher framerates remind us of home video and its associated cheapness (also with technical uses like interlaced sports broadcasts).

(Of course, these associations are also entangled with camerawork and other aspects, so it's not quite that simple. It's partly silly, because TV broadcast of 24fps material necessarily involved some less-than-perfect conversions)

On 1001 and approximations

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

NTSC made things interesting.

NTSC existed since 1941 was then exactly 30fps, had no color, and fit in 4.5MHz of usable radio frequency.

In 1953 it was replaced by what we now call NTSC color. Because people wanted it to be backwards compatible with the black and white televisions of the time, allowing one transmission to serve both, it was designed to send separate luminance (black and white) roughly as before, and separate chrominance information.

...and in the same frequency band, so chrominance would have to overlap the luminance. Minimizing the interference between the two, with the method they ended up choosing, meant some math that required that the amount of columns, amount of rows, or amount of frames per second had to be shifted a little.

They chose to settle columns and rows to 525 and 286, and fudge frames per second, meaning that that number is

4500000Hz / (525*286)

which happens to simplify to

30000/1001

Which is approximately

29.970029970

So, NTSC color broadcast is never 30fps, it's always 30000/1001.

The approximation 29.97 is inaccurate, though by so little that getting this wrong that it only becomes visually apparent after an hour or two.

Compared to what, though? Does it matter at all?

It can matter when processing content where NTSC is involved somehow, such as going between PAL and NTSC hardware/broadcast, so matters to professional transcoding. For typical-youtube-length videos you probably wouldn't even notice if this conversion was done incorrectly.

(Similarly, 23.976 for 24000/1001 (happens in film-to-NTSC conversion) is also slightly off, 23.97 and 23.98 more so. (verify)

(Since the same trick wasn't necessary for PAL, PAL is always 25fps precisely.)

(Also, film to PAL is often just played that ~4% faster.)

In most cases, the difference between the fraction and its approximation is tiny, but will be noticeable over time, in timecode, and probably audio sync, depending a little on how the audio is synced.

You can fix NTSC's timecode issue with Drop-Frame (DF) timecode.

Timecode counts in whole frames, so from its perspective the difference between 29.97fps instead of 30fps is ~0.02997 missing frames per second, which is roughly (30*60*60 - 29.97*60*60=) 108 frames per hour.

Drop Frame Timecode (DF) skips two of the frames from just its own count (these frames do not exist), from nine out of every ten minutes. This happens to work out as exactly (2 frames * (6*9) applicable minutes =) 108. (This technically still makes a rounding error, but it happens to be precise enough for most anything we do)

Non-Drop-Frame (NDF) refers to content without this 1001 NTSC bother in the first place.

Common constraints

When video is made for (analog) broadcast, it is very much constrained by that standard's color and frame coding, and more significantly in this context, its framerate.

When video is made for film, it is limited by projector ability, which has historically mostly been 24fps.

When it is made for NTSC or PAL broadcast, it is usually 30000/1001 or 25, respectively.

Computers will usually play anything, since they are not tied to a specific rate. Even though monitors have a refresh rate, almost all common video will be slower than, meaning you'll see all frames anyway. Computer playing is often synchronized to the audio rate).

Common conversions

Conversion from 24fps film to 25fps PAL broadcast rate can be done by playing the frames and audio faster by a factor 25/24, either letting the pitch be higher (as will happen in analog systems) or the same pitch by using digital filters (optional in digital conversion).

Few people will notice the ~4% difference in speed, pitch or video length.

This does not work for NTSC as the difference is ~25%. Instead, telecining is often used, though it also makes film on TV a little jumpier for the 60Hz/30fps countries (including the US).

Interlacing, telecining and such

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Progressive

Progressive means that each frame is drawn fully on screen, and that frames are drawn in simple sequence.

Seems very obvious. Best understood in contrast with interlacing and telecining (pixel packing can also matter somewhat).

Constraints in aether bandwidth and framerates are the main reasons for interlacing and telecining. Computers are not as constrained in those two aspects as broadcasting standards and CRT displays are, and as such, other factors (such as the compression codecs) tend to control the nature of video that is digitally available. This is one reason that much of it is progressive.

Interlacing

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Interlacing comes from a time where CRT screens were new, TV broadcast was already a thing, and engineers worked to squeeze more video out of fixed allocation of bandwidth.

Interlacing refers to a method where at you update every second line of physical scanlines, and the next lines doing only the ones you left out.

For example, given 25 full-screen rasters per second, you could do 50 refreshes per second each updating half the lines.

It considers that the phosphors that we're lighting up are relatively slow to fade out, and that the human visual system is less sensitive to flickering details than to flickering areas. This happens to be one of the least-noticeable ways of updating the screen at what looks like twice the speed and - not unimportantly - TVs that didn't know about interlacing at all would still do an entirely sensible thing too. So it was genuinely clever, given the constraints of the time.

The broadcaster had the option of doing either.

Conceptually, they could now choose to:

send 50fps of images that are each half vertical resolution

which looks faster, so great for things like sports broadcasts

send a 25fps source, essentially building up each full frame from two halves

gives more details

and would e.g. make sense when the source is 24fps film

Note that at lower level, you could consider

both to be 25 full-screen rasters per second,

both to be 50 updates per second,

and the only real difference to what you end up seeing is what they contain image-wise.

Why do it, and why not?

The main reason to do interfacing is to add speed, or the option for speed, within fixed bandwidth which was also already settled in how frames were transferred.

Which was absolutely true for analog TV - not so much since.

There is even something to be said for interlacing when you don't have such constraints. It's a simple compression-like scheme, that works well enough for video content that shows mostly large-ish details and is predictable enough. You can have deinterlacing make specific assumptions to estimate the original - an estimate that will be better than just doubling the lines, yet smaller than the original data. Advanced de-interlacing algorithms, supported by newer and faster hardware (and making assumptions that are true for most video), can bring quality back to levels that are surprisingly near the original, for most but not all content.

One good reason against interlacing is that it is inherently lossy.

Another is that when digital video compression is involved, you should probably leave such details up to the compressor. In fact, most deal poorly with interlaced data (unless they are expecting it).

While with SDTV broadcast, interlacing was just the way it worked for archaic reasons, with HDTV it's optional - and a specific choice.

HDTV does see use of interlaced programs, and HD camcorders still choose to do it, and mostly for the same reasons as before - saving bandwidth (but now in terms of network use and storage, instead of RF).

Yet it's also less significant, because while it does save space/bandwidth, it's not half, and there's now always compression anyway.

Also, when your job involves editing and/or compressing video, interlacing means extra steps, extra processing, and extra chance of specific kinds of artefacts.

Some more gritty

Note that depending on how (and where in the processing chain) you look at the content, you could refer to interlaced content's rate either by the tranferred rasters or the transferred frames - the transferred framerate or the shown framerate (30000/1001 or 60000/1001 for NTSC, 25 or 50 for PAL). This leads to some confusion.

From this perspective of displaying things, each image (raster) that is received/read contains two frames, shot at slightly different times, and updated on the display at different times.

This is not the only relevant view. Things become a little funnier around digital video. Consider that NTSC is 30 rasters per second regardless of what it contains, and it's simpler for digital video to just store those rasters.

Digital video will often not see interlaced video as having a doubled framerate. If you' just show that as-is, there is no longer a difference between rasters and frames, and you will show two (chronologically adjacent) half updates at the same time.

Which looks worse than progressive original on TV because while you're showing the same content, the interlacing lines are more apparent than on TV's phosphors.

For this reason, de-interlacing is often a good idea, which actually refers to a few different processes that make the video look better.

Interlacing in general and in analog systems general happens at the cost of display artifacts under particular conditions: while interlacing is usually not particularly visible on TVs, specific types of resulting problems are brought out by things such as fast pans (particularly horizontal movement), sharp contrast and fine detail (such as small text with serifs, computer output to TV in general), shirts with small stripes (which can be lessened with some pre-broadcast processing).

Interlacing is also one of a few reasons that a TV recording of a movie will look a little less detailed than the same thing on a (progressive) DVD, even when/though DVDs use the same resolution as TVs (other reasons for the difference are that TV is analog, and involves various lossy steps coming to your TV from, eventually, the original film material).

For notes on telecined content, see below. For now, note also that interlacing is a technique applied at the playing end (involves taking an image and playing it as two frames), while telecining just generates extra frames initially, which play progressively (or have inverse telecine applied to it; see details).

Note that when a storage format and player are aware of interlacing, it can be used smartly again. For example, DVDs may mix progressive, telecined, and interlaced behaviour. The content is marked, and the player will display it accordingly. Interlaced DVD content is stored in the one-image-is-two-frames way, and the player will display it on the TV in the double-framerate-half-updates way described above.

Deinterlacing

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Deinterlacing takes interlaced material and produces a progressive-scan result.

Often applied to reduce interlacing's visual artifacts, particularly jagged edges (a.k.a. saw tooth edge distortion, mice teeth, combing, serrations), less noticeable, for display or for (re)compression (as lossy video compression that isn't counting on interlaced video deals with it quite badly).

Note that in some cases deinterlacing reconstructs the original. In most cases it is an inherently lossy process, in that it throws away data and isn't exactly reversible - but may be worth it perceptually.

There are also smarter and dumber ways of doing deinterlacing (most of those detailed below are the simple and relatively dumb variants), and the best choice depends on

whether you are doing it for display or storage/(re)compression
whether you displaying on a CRT (phosphor and scanline) or something else with output restrictions, or a digital display (which tends to have few restrictions)
the nature of the video before interlacing (are the two fields camptured at different times or not?)

Weave

Weaving creates progressive frames by showing both fields at the same time, still interlaced.

Yes, this is a fancy name for 'doing nothing', other than copying each line from the alternate frame.

It's simple to do, and it's fast. Technically it retains all the video data, but in a way that doesn't look good in various practice.

Weaving video A means we reconstruct the video at its original resolution and frame rate, and can show it as such digitally.

Weaving video B means we construct a 25fps video with jagged lines (when there is fast movement). On digital display this tends to be more noticeable than on CRTs, so for some purposes you would be better off with any deinterlacing that tries for 50fps output. In particular video compression, as most codecs don't deal so well compressing such one-pixel details, so this will typically give lower-quality encodes (for same-sized output).

Digital capture from interlaced media (e.g. a TV capture card) will often capture in a way that effectively weaves frames, which is why you get the jagged-movement effect if the video is never deinterlaced.

Discard

When displaying video on a device with too little processing power, an easier-to-code and faster method is discarding every second field (a.k.a. single field mode) and drawing lines from the other twice (line doubling), to get output that has the same size and same framerate as the original.

On video A we throw away half the vertical detail (and produce 25fps video).

On video B we throw away half the vertical detail and half the time material as well (and produce 25fps video).

Compare with bob. (basically, discard is half a bob)

Blending

You can use all the data by blending (a.k.a. averaging, field combining) both fields from a frame into a single output frame.

For video A, you would produce an (unecessarily) blurred version of the original. Weave may be preferable in various cases.

For video B you would create a 25 frame per second version with less jagging, but motion will have a sort of ghosting to it. Better than weave in that the jagging isn't so obvious, but everything will be blurred.

(Note that sizing down video has a similar effect, and can even be used as a quick and dirty way of deinterlacing if software does not offer any deinterlacing options at all)

Bob

Bob, Bobbing (also 'progressive scan', but that is an ambiguous term) refers to taking both fields from the frame and displaying them in sequence.

Video A would become line-doubled 50 frames per second. Stationary objects will seem to bob up and down a little, hence the name. You're also doubling the amount of storage/bandwidth necessary for this 25fps video while reducing its vertical detail.

Video B would be shown with its frames at its natural fluid 50 frames per second (note that most other methods would create 25 frame per second output). (note that if you take the wrong lines first, you'll put the in the wrong order and the video looks a bit twitchy)

Bob-and-weave and optional cleverness

Bob and weave refers to the combination of bob and weave.

Telecine

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Telecine (portmanteau of 'television' and 'cinema') refers to the process of converting video between these two worlds -- in a wide sense, so can refer to various distinct methods.

For a long while, the most common were:

conversion from film to PAL television (25fps)

often by playing the frames at 25fps and the audio a little faster

conversion from film to NTSC television (30fps, adding intermediate frames)

specifically three-two pull down - in some contexts telecine is used as a near-synonym for 3-2 (which is a little misleading, and this page is also guilty of that)

Frame rate conversion from 24fps film to 30000/1001 broadcast NTSC is usually done using three-two pulldown, which uses a constant pattern of using some frames as-is, some interlace-like half-updating frames. This variant of a more general technique turns groups of 4 frames into 5, which for 24fps input means an extra 6 frames per second.

Since you end up with 30 frames for a second, which look approximately the same as the 24fps original, the audio speed can stay the same. It still means leaving some frame content on-screen longer than others, which is visible in some types of scenes. For example, a slow smooth camera pan would be shown with a slight judder.

Three-two pulldown is inversible, in that you can calculate the original frames from a stream of telecined frames, though if content is spliced after three-two-pulldown is applied, you may lose the ability to reconstruct a frame or two

Video editing would probably choose to decode to the frames.

Potentially, so might playback on monitors, because the original frames may be a little lower on the judder.

Inverse telecine is also useful when (re)compressing telecined video, since many codecs don't like the interlace-like effects of telecining and may deal with it badly.

Hard telecining' refers to storing telecined content e.g. 30000/1001 for NTSC generated from 24fps film content, so that the given content can be played (progressively) on NTSC equipment. The upside is that the framerate is correct and the player doesn't have to anything fancy, the downside is that it usually has negative effect on quality.

Soft telecining refers to storing video using the original framerate (e.g. 24000/1001 or 24fps film) and flagging the content so that the player (e.g. box-top DVD players) will telecine it to 30000/1001 on the fly. NTSC DVDs with content originating from 24fps film usually store 24000/1001fps(verify) progressive, with pulldown flags set so that the DVD player will play it as 30000/1001fps.

See also:

http://en.wikipedia.org/wiki/Telecine
- http://en.wikipedia.org/wiki/Telecine#Frame_rate_differences

Mixed content

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Most container formats aren't picky about the exact contents and will allow contain mixes of different frame rates, a mix of progressive and telecined, progressive and interlaced, and sometimes even all three.

When you want to edit video content further, you probably want to figure out whether a video is progressive, interlaced or telecined.

One way to do this is using a player that allows per-frame advancing (such as mplayer). Make sure it's not applying filters to fix interlacing/telecining, find a scene with movement (preferably horizontal movement/panning), and see whether there are interlace-style jaggies.

If there are none, it is progressive (or possibly already deinterlaced by the player)
If there are in every frame, it is interlaced
If there are in only some frames, it is telecined (two out of five in 2:3, 24-to-30fps telecine).

Note that things like credits may be different (apparently often telecined on DVDs).

While telecining uses regular patterns of extra frames, splicing after telecining means the video will usually not follow that pattern around a splice, meaning that inverse telecine may not be able to decode all original frames. This is often the cause of encoders/players complain about a few skipped and/or duplicate frames in a movie's worth of frames, and you can ignore this - hardware players do the same.

(Analog) TV formats

There are a number of variants on NTSC, PAL and SECAM that may make TVs from different countries incompatible. NTSC is used in North America and part of South America (mostly NTSC M), and Japan (NTSC J).

PAL is used in most of Europe, part of South America, part of Africa, and Asia. SECAM is used in a few European countries, part of Africa, and Russia.

PAL M (used in Brazil) is an odd one out, being incompatible with other PAL standards, and instead resembling NTSC M - in fact being compatible in the monochrome part of the NTSC M signal.

While NTSC and PAL are largely the same inside (PAL fixed one color consistency bug that NTSC was stuck with because the standard was already set), so differ mainly in frame rate and amount of lines.

CRT TVs often support just one of these, as it would be complex (and imperfect) to convert more than one, and few people would care for this feature as most people had one type of broadcast around.

It should be noted that the actual broadcast signal imagery uses more lines than are shown on the screen. Of the video lines, there are fewer that are the raster, the imagery that will be shown on screen.

525-scanline video (mostly NTSC) has 486 in the raster, and many show/capture only 480(verify)
625-scanline video (mostly PAL) has 576 in the raster

The non-raster lines historically were the CRT's vertical blanking interval (VBI), but now often contains things like teletext, closed captioning, station identifiers, timecodes, sometimes even things like content ratings and copy protection information (note: not the same as the broadcast flag in digital television).

Video recording/capture will often strip the VBI, so it is unlikely that you will even have to deal with it. Some devices, like the TiVo, will use the information (e.g. respect copy protection) but do not record it (as lines of video, anyway).

Devices exist to add and alter the information here.

PAL ↔ NTSC conversion

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Note that various DVD players do this, although others do not, and neither the fact that they do or that they don't is necessarily advertized very clearly.

PAL to NTSC conversion consists of:

Reducing 625 lines to 525 lines
creating ~5 more frames per second

NTSC to PAL conversion consists of:

increasing 525 to 625 lines
removing ~5 frames per second

The simplest method, that cheaper on-the-fly conversion devices often use, is to duplicate/omit lines/frames. This tends to not look great (choppiness, and cropping or stretching the image).

Linear interpolation (of frames and lines) can offer smoother-looking motion and fewer artifacts, but are more computationally expensive, and have further requirements (such as working on deinterlaced content).

Fancier methods can use things like motion estimation (similar to fancy methods of deinterlacing)

Digital / HD broadcasting

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

ATSC in the US, DVB in Europe

See also:

Semi-sorted

On types and groups of frames

In MPEG-like codecs (DivX/Xvid, H.264, and more), there is a choice to encode each frame as a/an...

I-Frame (inter-frame)
- a frame you can decode purely from its own data
- larger than predictive (B- and F-)frames when B/F can predict differences to adjacent frames fairly well (e.g. motion and such), which is usually. When predictive frames don't work well, such as camera cuts, I-frames are preferable.
- having a guaranteed I-frames every so often helps faster seeking, because seeking often means "look for the most recent I-frame, and start decoding all video until you reach the requested frame"

P-frame (predictive)
- uses information from previous frame
- ...which is typically less information than a complete I-frame, so makes for better compression

B-frame (bidirectional predictive)
- uses information both from previous and next frame
- which ends up being less information than forward-only prediction
- does better on slow motion(verify)
- more complex to encode
- more complex to decode

There also used to be D-frames, in MPEG-1, which were basically low-quality easy-to-decode I-frames, which allowed fast preview while seeking.

A GOP (group of pictures) is a smallish group of adjacent frames that belong together in some sense.

A GOP starts with an I-frame, and may contain P-frames, B-frames, and technically can also contain I-frames (a GOP is not defined by position of I-frames, but by being marked as a new GOP in the stream. That said, multiple-I-Frame GOPs are not common)(verify).

Most encoders use I-frames (so start a new GOP)...

when it makes sense for the content, such as on scene cuts where a predictive frame would be lower quality
once every so often, to guarantee seeking is always fastish by always having a nearby I-frame to restart decoding at
other frame-type restrictions. For example, encoders are typically limited to use at most 16 B-frames in a row, and at most 16 P-frames in a row (which gets more interesting when you mix I, P, and B-frames)

To make things more interesting, the frame store order, the decode order, and the display order can all be different. This is largely because (and when) a B-frame can by its nature not be decoded without having the two adjacent frames already decoded, so they are usually stored surrounded by I or P [77]

If your video consists entirely of I-frames, you could call that GOP-less, or size-1-GOP. This is much larger for the same video, but means all seeking is fast, so is nice when editing video, and when you want fast frame-by-frame inspection in both directions, for example if you want to study animation.

A closed GOP is one that can be decoded completely without need of another GOP - basically, ends in a P-frame.

This is contrasted with an open GOP, which ends in a B-frame so which needs to look into the next GOP's first frame (an I-frame).

Open GOPs make for slightly more efficient coding, because you're using a little bit more predictive coding, and in most cases you're just playing all of it anyway.

Closed GOPs, make for slightly easier decoding, and for slightly faster seeking for some frames.

Standardized formats, particularly for hardware players have guidelines that usually restricting to relatively small GOP sizes, on the order of 5 to 30 frames.

For example, the specs for DVDs apparently says 12 frames max, which is about half a second worth of video, in part to guarantee seeking in reasonable steps. Bluray seems to say 24 and/or 1 second (though it depends a little(verify))

Allowing large GOPs (say, 300 frames, ~10 seconds) makes for slightly more efficient coding in the same amount of bytes (because you can make a few frames predictive rather than forcing an I-frames), but it's a diminishing-returns thing.

Keep in mind that various video players seek faster by going to the nearest I-frame to the requested frame rather than the requested frame itself. Which is why they won't always jump by the same amount, won't even do fine seeking, and with very large GOPs may act a little awkwardly.

To inspect what frame types a file has:

People mention ffprobe -show_frames but this seems to be a deprecated option.(verify)

libavcodec has a vstats option, which writes a file in the current directory with frame statistics about the input file. For example:

mplayer -vo null -nosound -speed 100.0 -lavdopts vstats input.avi

(Without -speed it seems to play at the video rate, and there's probably a way around that better than a -speed that it probably won't reach)

or ffmpeg:

ffmpeg -i input.avi -vf showinfo -f null -

Some notes on aspect ratio

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Display Aspect Ratio (DAR) means "this ought to be shown at this ratio". Example: 16:9. This is information that some files can store to tell the player to do so (or that standards can imply for all files of a type).

DAR matters allows more arbitrary aspect ratio in the actual pixel dimensions - for example, SVCDs are 480x480 (NTSC) or 480x576 (PAL/SECAM) -- but store content meant for 4:3 or 16:9 display. Which is one way to store more vertical than horizontal detail. SVCD players will rescale this to some resolution that has the proper aspect ratio at play time (usually just fitting it in the largest non-cropping size for the given TV resolution(verify)).

This works because MPEG can store aspect ration information, so hardware players and most software players listen to and use it. Not all (software) players understand it in MPEG4 files yet, though.

AVI (and any content stored in it, including MPEG4) does not support it -- but the opendml extension that does allow it is now fairly commonly used. Not all players know about opendml, though most that matter do.

When encoding, the best choice depends on what you want to play things on. The most compatible way is rescaling so that square pixels would play correctly. However, this usually means smallish resolution changes which can look like mild blurs .

Some notes on frame rates

See also #Frame_rate.2C_analog_TV_format.2C_and_related for basic notes on frame rate.

Resolution names/references

This stuff is vaguer and/or more confusing than it should be.

Consider 2K.

Around cinema it means DCI 2048x1080 {{comment|(and some cropped variants, 1998x1080, 2048x858)
Around PCs it means 2560x1440. Except when it doesn't.

And then consider lists like WSXGA+, WQXGA, WQXGA+, WQSXGA, WUQSXGA, CWSXGA, WQHD.Without looking it up, do you know any of their resolutions? And which one of them doesn't exist?

Without looking it up, what'd the difference between QuadHD and Quad FullHD?

If you can answer either of those questions, you're a hardware freak, the proverbial 1%.

And then there's the fact that naming one number has different meaning from context and industry

Cinema resolutions are defined by DCI.

which is basically named by horizontal resolution fairly precisely

On PCs

people seem to have settled on a "agree to disagree" vagueness around some references

Take 2K. it's horizontal resolution, yet can be and is explained as both

"where the width is approximately 2000" or

"...starts width a 2, i.e. falls into the 2000...3000 pixel range" (except that is not the convention for 4K, 5K, 8K, 10K, or 16K)

It can mean 2560x1440, which it often seems to do when you say "2K monitor"

Or 2048x1080 (some diagrams specifically call this 2K, and call 2560x1440 WQHD and not 2K)

Or 1920x1080 (a.k.a. 'Full HD') because it's close enough. (Or not, with a "it was an existing thing when we wanted 2K to mean a different thing" argument)

Or maybe sometimes 2048x1536

and technically includes Cinema 2K

TV style resolutions are typically referred to by their vertical, shortest edge

in part because historically, the interlaced/progressive difference was important, and the larger horizontal resolution could be implied from the vertical, largely because there were only a few

I've worked in a field where sensors are usually square but if not we'd often use the shortest edge because it was limiting

TV

References like 480i and 720p became more commonly used in the era commonly known as HD (now), partly just because it's brief and more precise.

(These references are not often seen alongside monitor resolutions, perhaps because "720p" and "1080p HD" is a new name that is easier to market when you don't realize that a good deal of CRT monitors have done such resolutions since the nineties.)

References such as 480i and 720p refer to the vertical pixel size and whether the video is interlaced or progressive.

The common vertical resolutions:

480 (for NTSC compatibility)
- 480i or 480p

576 (for PAL compatibility)
- 576i or 576p

720
- always 720p; 720i does not exist as a standard
- 1280x720 (sometimes 960x720)

1080 (HD)
- 1080i or 1080p
- usually 1920x1080

There are some other newish resolutions, many related to content for laptops/LCDs, monitor/TV hybrids, widescreen variations, and such.

HD TV broadcasts for a long while were typically either 1080i or 720p. While 1080i has greater horizontal resolution 1920x1080 versus 1280x720), 720p does not have interlace artifacts and may look smoother.

The 480 and 576 variants usually refer to content from/for (analog) TVs, so often refer to more specific formats used in broadcast.

576 often refers to PAL, more specifically:
- analogue broadcast TV, PAL - specifically 576i, and then often specifically 576i50
- EDTV PAL is progressive, 576p

480 often refers to NTSC, more specifically:
- analogue broadcast TV, NTSC - specifically 480i, and then often specifically 480i60
- EDTV NTSC is progressive, 480p

486 active lines seems to refer to older NTSC - it now usually has 480 active lines

There is more variation with various extensions - widescreen, extra resolution as in e.g. PALPlus, and such.

Sometimes the frame rate is also added, such as 720p50 - which usually refers to the display frequency applicable.

In cases like 480i60 and 576i50 you know this probably refers to content from/for NTSC and PAL TV broadcast, respectively (...though there are countries with other res-and-frequency combinations).

See also:

Analog horizontal pixel resolution is approximate

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

For analog TV, pixel-per-line resolution is not really set in stone. Because of the way the signal is used, anything above ~500 or so looks good enough.

Cheaper NTSC CRTs couldn't really display more than 640, cheaper PAL CRTs (verify)
720 was treated as a maximum (fancy editing systems of the time supported it)
704 is a sort of de facto assumption of the average that TVs tend to display(verify), and is also what EDTV uses (704x576 for PAL and 704×480 for NTSC)

As such,

NTSC can be 720x480 or 704x480, or 640x480,
PAL can be 720x576 or 704x576,

depending a little on context.

On digital broadcast, a stream has a well-defined pixel resolution, but since the displays are more capable, they are typically quite flexible in terms of resolution and frame rate.

Relevant acronyms here include

ATSC (digital broadcasting in the US, replacing analog NTSC)
DVB (digital broadcasting in Europe, replacing analog PAL)
EDTV (sort of halfway between basic digital broadcast and HDTV)
HDTV

More resolutions

See e.g. the image on http://en.wikipedia.org/wiki/Display_resolution

Screen and pixel ratios

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Video capture hardware

Video editing hardware

Webcam/frame-oriented software

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Video format notes

Digital video (files, streaming)

Container formats

AVI (Audio Video Interleave)

MKV (Matroska Video)

Ogg

Proprietary/minor/other

DVD-Video

Stream identifiers (FourCCs and others)

Video codecs

Earlier formats

Nineties to noughties and later

Containers that meant different things over time

Unsorted

Pixel/color formats (and their relation to codecs)

Streaming, streaming support protocols

Subtitles

Player support

MPEG notes

Overview

Parts

A bit more real-world

MPEG audio codecs

MPEG video codecs

Container format notes

MPEG 1 and 2

MPEG-4 container

3GPP

MPEG streams can contain...

Other transports

Frame rate, analog TV format, and related

Frame rates

Common rates

On 1001 and approximations

Common constraints

Common conversions

Interlacing, telecining and such

Progressive

Interlacing

Deinterlacing

Weave

Discard

Blending

Bob

Bob-and-weave and optional cleverness

Telecine

Mixed content

See also (interlacing, telecining)

(Analog) TV formats

PAL ↔ NTSC conversion

Digital / HD broadcasting

See also (frames and formats)

Semi-sorted

On types and groups of frames

Some notes on aspect ratio

Some notes on frame rates

Resolution names/references

TV

More resolutions

Screen and pixel ratios

Video capture hardware

Video editing hardware

Webcam/frame-oriented software

(What you may want to look for in) more-than-webcam software

Mainly for editing

Mainly for conversion

Some specific tools

See also

Navigation menu