Video file format notes

From Helpful
Jump to navigation Jump to search

The physical and human spects dealing with audio, video, and images

Vision and color perception: objectively describing color · the eyes and the brain · physics, numbers, and (non)linearity · color spaces · references, links, and unsorted stuff

Image: file formats · noise reduction · halftoning, dithering · illuminant correction · Image descriptors · Reverse image search · image feature and contour detection · OCR · Image - unsorted

Video: file format notes · video encoding notes · On display speed · Screen tearing and vsync

Simpler display types · Video display notes · Display DIY
Subtitle format notes


Audio physics and physiology: Sound physics and some human psychoacoustics · Descriptions used for sound and music

Noise stuff: Stray signals and noise · sound-related noise names · electronic non-coupled noise names · electronic coupled noise · ground loop · strategies to avoid coupled noise · Sampling, reproduction, and transmission distortions · (tape) noise reduction


Digital sound and processing: capture, storage, reproduction · on APIs (and latency) · programming and codecs · some glossary · Audio and signal processing - unsorted stuff

Music electronics: device voltage and impedance, audio and otherwise · amps and speakers · basic audio hacks · Simple ADCs and DACs · digital audio · multichannel and surround
On the stage side: microphones · studio and stage notes · Effects · sync


Electronic music:

Electronic music - musical and technical terms
MIDI ·
Some history, ways of making noises
Gaming synth ·
VCO, LFO, DCO, DDS notes
microcontroller synth
Modular synth (eurorack, mostly):
sync · power supply · formats (physical, interconnects)
DIY
physical
Electrical components, small building blocks
Learning from existing devices
Electronic music - modular - DIY


DAW: Ableton notes · MuLab notes · Mainstage notes


Unsorted: Visuals DIY · Signal analysis, modeling, processing (some audio, some more generic) · Music fingerprinting and identification

For more, see Category:Audio, video, images

📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.


This page is mostly about storage of video, and variation therein. It also touches on some video capture.

For Notes on encoding video, see that.


Digital video (files, streaming)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

This is meant primarily as a technical overview of the codecs in common and/or current use (with some historical relations where they are interesting, or just easy to find), without too many details; there are just too many old and specialist codecs and details that are not interesting to most readers.


Note that some players hand off reading/parsing file formats to libraries, while others do it themselves.

For example, VLC does a lot of work itself, particularly using its own decoders. This puts it in control, allowing it to be more robust to somewhat broken files, and more CPU-efficient in some cases. At the same time, it won't play unusual files as it won't perfectly imitate other common implementations, and it won't be quite as quick to use codecs it doesn't know about; in these cases, players that hand off the work to other things (such as mplayerc) will work better.


Container formats

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Containers are file types that usually allow various streams of various types and using various codecs

Some relatively general-purpose container formats include:


AVI (Audio Video Interleave)

AVI, short for Audio Video Interleave (a RIFF derivative; see also IFF), was a container format for varied video and audio, and in itself said little of what it did contain.

It was quite common a long time, though has now been displaced because of lacking features, and it also didn't help that many AVIs in the wild have to technically violate the AVI standard (but play fine on many computer players) because it does not really allow things like embedded subtitles, or multiple audio tracks, without some hackery. That hackery might be convention, but still not standard, and it makes it hard to guarantee AVIs will play everywhere.


Derived:

  • Files with the .divx extension are usually AVIs (...containing DivX video)
  • Google Video (.gvi) files use MPEG-4 ASP and MP3 in a mild variant on AVI container [1] (and do not really exist anymore)

MKV (Matroska Video)

An open standard, and compared to e.g. AVI is a well well designed many-stream format, and because it allows subtitle embedding, meaning you avoid hassle related to external subtitle files.

Ogg

Ogg is a container format - see also Ogg notes - and an open standard.


Extension is usually .ogg, sometimes .ogm, though .ogv, .oga, and .ogx are also seen.

Note that initially, ogg often impied Ogg Vorbis: Ogg containers containing Vorbis audio data.



Ogg Media (.ogm) is an extension of Ogg, which supports subtitle tracks, audio tracks, and some other things that make it more practical than AVI, and put it alongside things like Matroska.

Ogg Media is not really necessary and will probably not be developed, in favour of letting Matroska become a wider, more useful container format.(verify)

Proprietary/minor/other

A number of container formats support only a limited number of codecs (sometimes just one), particularly if they are proprietary and/or specific-purpose.

Such container formats include:

  • Flash video (.flv) [2]
  • NUT (.nut), a competitor to avi/ogg/matroska [3]
  • Quicktime files (.mov) are containers, though without extensions to quicktime, they support relatively few codecs. In recent versions, MPEG-4 was added.
  • ASF (Advanced Systems Format), a proprietary format from Microsoft, most commonly storing wma and wmv content, and sees little other use in practice (partly because of patents and active legal protecting). [4]
  • RealMedia (.rm)
  • DivX Media Format (.dmf)


Fairly specific-purpose:

  • Digital Picture Exchange (.dpx) [6]
  • Material Exchange Format (.mxf) [7]
  • Smacker (.smk), used in some video games [8]
  • Bink (.bik), used in some video games [9]
  • ratDVD

DVD-Video

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

tl;dr: data in MPEG-2 PS, some restrictions, and some DVD-specific metadata/layout around it.


A VIDEO_TS directory with VOB, IFO, and BUP files are, in a fashion, a container format as they are the DVD-Video way of laying out:

  • metadata about steam data (chapters, languages of tracks, angles, etc.)
  • Video streams (usually MPEG-2, sometimes MPEG-1)
  • Audio streams (AC-3, MPEG-1 Layer II (MP2), PCM, or DTS)
  • Subtitle streams (bitmap images)

(note: The AUDIO_TS directory is used by DVD-Audio discs, which are fairly rare. On DVD-Video discs, this directory is empty, and the audio you hear is one of the streams in the VOBs.)


IFO stores metadata for the streams inside the VOB files (e.g. chapters; subtitles and audio tracks). BUP files are simply an exact backup copy of the IFO files (to have a fallback for a scratched DVD).


VOB files are containers based on MPEG-2 PS, and store the audio, video, and image tracks.

VOB files are segmented in files no larger than 1GB, which was a design decision meant to avoid problems with filesystem's file size limits (since the size of a DVD was larger than many filesystems at the time could deal with).


DVD players are basic computers in that they run a virtual machine. DVD-Video discs with menus run bytecode on that, although most such code is pretty trivial if you consider the potentially flexibility of the VM -- there are a few DVD games, playable by any DVD player.


See also:

Stream identifiers (FourCCs and others)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

When container formats can store more than one video codec, they want to be able to indicate the format (codec) used in each stream.

For example:

  • AVI uses FourCCs, a sequence of four bytes used in AVI and a few others - usually four printable ASCII characters
  • MPEG containers mostly just contains MPEG video (...but there're a bunch of details to that)
  • Matroska (mkv) uses another system, CodecID, a flexible-length string.
  • Ogg doesn't have an identifier system, instead asking all available codecs whether they can play the data given to them (initially just the first frame from a stream).


Video codecs

Earlier formats

  • Various RLE-like formats, used primarily for very simple animations


  • Flic (.fli, .flc), primarily video-only files used in Autodesk Animator
http://wiki.multimedia.cx/index.php?title=Flic_Video
https://en.wikipedia.org/wiki/FLIC_(file_format)


  • Cinepak
http://wiki.multimedia.cx/index.php?title=Cinepak


  • Intel Indeo:
    • Indeo 2 (FourCC: RT21) [10]
    • Indeo 3 (FourCC: IV31 for 3.1, IV32 for 3.2) [11]
    • Indeo 4 (FourCC: IV40, also IV41 for 4.1) [12]
    • Indeo 5.0 (FourCC: IV50) [13]


  • MJPEG is mostly just a sequence of JPEG images (FourCC: AVDJ, AVID, AVRn, dmb1, MJPG, mjpa, mjpb). [14] [15]
There are also some variations on this theme


Early H.26x family (related to MPEG and ITU standards. H.something is the ITU name):

  • H.261, a format made for videoconferencing over ISDN. Came before the more widely used H.263 [16]
  • H.262, which is identical to part of the MPEG-2 standard
  • H.263: for videoconferencing (seen used in H.323).
    • See also [17]
    • Also the base of various other codecs, including:
      • VIVO 1.0, 2.0, I263 and other h263(+) variants
      • Early RealVideo
      • Sorenson (including early Flash video)
        • Sorenson 1 (SVQ1, svq1, svqi)
        • Sorenson Spark (Used in Flash 6, 7 and later for video)
        • (Sorenson 3 (SVQ3) was apparently based on a H.264 draft instead)


H.261, H.262, H.263 hasn't really been relevant since the late nineties, early noughties, due to better things (both realtime and non-realtime cases) being available, such as MPEG-4, though the transition was more gradual than that. Consider e.g. Flash, RealVideo, WMV:

Nineties to noughties and later

  • MPEG-4 part 2, a.k.a. MPEG-4 ASP
    • DivX, XviD, and many versions, variants, and derivatives
    • FourCC: [18] mentions 3IV2, 3iv2, BLZ0, DIGI, DIV1, div1, DIVX, divx, DX50, dx50, DXGM, EM4A, EPHV, FMP4, fmp4, FVFW, HDX4, hdx4, M4CC, M4S2, m4s2, MP4S, mp4s, MP4V, mp4v, MVXM, RMP4, SEDG, SMP4, UMP4, WV1F, XVID, XviD, xvid, XVIX
(See also MPEG4)


  • H.264, a.k.a. MPEG-4 AVC, MPEG-4 Part 10
    • FourCC depends on the encoder (not too settled?).
      • ffmpeg/mencoder: FMP4 (which it also uses for MPEG-4 ASP, i.e. DivX and such. It seems this is mostly meant to send these files to ffdshow(verify), but not all players understand that)
      • Apple: avc1
      • Various: H264, h264 (verify)
      • Some: x264 (verify)


  • On2 (Duck and TrueMotion also refer to the same company):
VP3 (FourCC: VP30, VP31, VP32): [19]. Roughly in the same class as MPEG4 ASP. Open sourced.
VP4 (FourCC: VP40) [20]
VP5 (FourCC: VP50): [21] [22]
VP6 (FourCC: VP60, VP61, VP62): Used for some broadcasting [23] [24]
VP7 (FourCC: VP70, VP71, VP72): A competitor for MPEG-4 [25] [26]
Xiph's Theora codec is based on (and better than) On2's VP3 [27]


  • AV1
basically a successor to VP9, from the Alliance for Open Media
considerably more efficient than VP9, H.264, somewhat more efficient than H.265 (HEVC)
open, royalty-free (like VP9 and Theora), which makes it less cumbersome to adopt than H.264, H.265 license-wise


  • WebM
    • VP8 or VP9, plus Vorbis or Opus, in Matroska
started by Google after acquiring On2
supported by all modern browsers (like H.264)
open, also royalty-free (unlike some parts of MPEG4)
Quality is quite comparable to H.264


Dirac [28] is a new, royalty-free codec from the BBC, and is apparently comparable to H.264(verify).


  • H.265, a.k.a. HEVC
see also MPEG notes


  • H.266, a.k.a. VVC



Containers that meant different things over time

  • RealVideo uses different names internally and publicly, some of which are confusable:
RealVideo (FourCC RV10, RV13) (based on H.263)
RealVideo G2 (fourCC rv20) used in version 6 (and 7?) (based on H.263)
RealVideo 3 (FourCC rv30) used in version 8 (apparently based on a draft of H.264)
RealVideo 4 (FourCC RV40, and also UNDF) is the internal name/number for the codec used in version 9. Version 10 is the same format, but the encoder is a little better.
The H.263-based versions (up to and including 7) were not very impressive, but versions 9 and 10 are quite decent. All are proprietary and generally only play on RealPlayer itself, unless you use something like Real Alternative.


  • Flash video [29] (preferred first, will play list) (verify)
Flash 6: Sorenson Spark (based on H.263)
Flash 7: Sorenson Spark
Flash 8: VP6, Sorenson Spark [30]
Flash 9: H.264, VP6, Sorenson Spark (and understands MP4, M4V, M4A, 3GP and MOV containers)
Flash 10: (verify)


Microsoft:

  • Windows Media Video [31], often in .wmv files (which are asf containers)
    • version 7 (FourCC: WMV1) (based on MPEG-4 part 2)
    • version 8 (FourCC: WMV2)
    • version 9 (FourCC: WMV3)
  • RTVideo [32]
  • VC-1 [33]


Apple:

  • Quicktime [34]
    • 1: simple graphics, and RPZA video [35]
    • 5: Added Sorenson Video 3 (H.263 based)
    • 6: MPEG-2, MPEG-4 Part 2 support. Later versions also added Pixlet [36] [37]
    • 7: H.264/MPEG-4 AVC, better general MPEG-4 support
  • Internal formats like 'Intermediate Codec' [38] and ProRes [39]


Unsorted

  • Uncompressed Raw YUV [40]
  • Compressed YUV, e.g.
    • HuffYUV (lossless, and easily over 20GB/hour)
  • RawRGB (FourCC: 'raw ', sometimes 0x00000000) [41]
  • Hardware formats: (verify)
    • AVID
    • VCR2
    • ASV2


See also:

Pixel/color formats (and their relation to codecs)

Streaming, streaming support protocols

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

See Streaming audio and video

Subtitles

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Hardsubs are a jargon term that refers to subtitles that are mastered to be directly part of the video. They have no special status, mask any actual video. This avoids all support issues, and usually looks good, but they give no choice of language, or whether to display the subtitles or not.


Softsubs refer to separate subtitle data, historically often as a separate file with the same name and a different extension, and more recently as a part of container formats which support multiple streams (such as MKV), which can also store multiple different subtitles (e.g. languages) at once.


There are a number of formats, and not all file extensions are very obvious. Particularly things like .sub and .txt may be one of various formats.

See also Subtitle format notes



Player support

MPEG notes

Overview

MPEG refers to such a large sprawl of things that it's hard to have a overview of even just the interesting bits.

It refers to a wide set of standards, which includes various media containers, media codecs for audio, video, and more.


MPEG can refer to one of three sets of standards: MPEG-1, MPEG-2, and MPEG-4 (3 was skipped to avoid confusion with MP3, which refers (typically) to MPEG-1 audio layer III), formats that can store video and/or audio streams, and a little more.


Note that

  • MPEG-1 is ISO 11172 (the important bits mainly from 1993)
  • MPEG-2 is ISO 13818 (mainly from 1995..1997)
  • MPEG-4 is ISO 14496 (since 1999, parts being updated and added more steadily)
it's often shorter and easier to refer to, say, "MPEG-4 Part 3" rather than "ISO/IEC 14496-3"
People probably use MPEG-4 largely refer to the standard as a whole, the container format, and the ASP and AVC video codecs


  • There's a lot of parallel and adopted standards going on, e.g.
It's all ISO/IEC since MPEG-2 (fairly common on technical standards)
some parts are also basically identical to ITU standards (also common)
parts adopted from elsewhere (e.g. TwinVQ from NTT(verify)), sometimes in tight cooperation, sometimes more for the use of standardization (verify)
  • There are updates over many years
some more structural, like how MP4 file format (Part 14, from 2003) revises/extends and replaces the earlier definition in Part 12, and updates one bit of Part 1 (verify)


MPEG Parts

(More widely known / used / interesting parts bolded)

MPEG-1 parts:

                                   last   
                         since    change    
Part 1   ISO 11172-1      1993     1999     Systems     (basically refers to the container structure and 
                                                               details like syncing and multiplexing)
Part 2   ISO 11172-2      1993     2006     Video, basically H.261
Part 3   ISO 11172-3      1993     1996     Audio             including what we know as MP3
Part 4   ISO 11172-4      1995     2007     Compliance testing     
Part 5   ISO 11172-5      1998     2007     Software simulation 


MPEG-2 parts

Part 1    ISO 13818-1     1996     2016     Systems, also H.222.0          
Part 2    ISO 13818-2     1996     2013     Video, basically H.262 (very similar to H.261 / MPEG-1 Part 2, 
                                                                          adds details like interlacing)
Part 3    ISO 13818-3     1995     1998     Audio, much like MPEG-1 Part 3, e.g. extending channels but in a 
                                                         backwards compatible way (hence MPEG-2 BC)
Part 4    ISO 13818-4     1998     2009     Conformance testing     
Part 5    ISO 13818-5     1997     2005     Software simulation     
Part 6    ISO 13818-6     1998     2001     Extensions for Digital Storage Media Command and Control (DSM-CC)
Part 7    ISO 13818-7     1997     2007     Advanced Audio Coding (AAC) (a.k.a. MPEG-2 NBC Audio, 
                                               non-backwards-compatible, to contrast with MP3/Part 3)
Part 9    ISO 13818-9     1996              Extension for real time interface for systems decoders     
Part 10   ISO 13818-10    1999              Conformance extensions for DSM-CC
Part 11   ISO 13818-11    2004              Intellectual Property Management and Protection (IPMP) 
                                               on the MPEG-2 system

Part 8 was 10-bit video but was never finished because of little interest.


MPEG-4 parts

Part 1    ISO 14496-1     1999     2014   Systems   including the MPEG-4 file format 
Part 2    ISO 14496-2     1999     2009   Visual    including Advanced Simple Profile (ASP), better known as DivX
Part 3    ISO 14496-3     1999     2017   Audio     including AAC, ALS (lossless), SLS, 
                                                    Structured Audio (low bitrate), HVXC and CELP (speech)
Part 4    ISO 14496-4     2000     2016   Conformance testing    
Part 5    ISO 14496-5     2000     2017   Reference software     
Part 6    ISO 14496-6     1999     2000   Delivery Multimedia Integration Framework (DMIF),
                                            basically an API that abstracts out network transfers
Part 7    ISO 14496-7     2002     2004   Optimized reference software for coding of audio-visual objects
Part 8    ISO 14496-8     2004            MPEG-4 content over IP networks, think RTP, SDP transport, some guidelines
Part 9    ISO 14496-9     2004     2009   Reference hardware description
Part 10   ISO 14496-10    2003     2016   Advanced Video Coding (AVC), a.k.a. ITU-T H.264
Part 11   ISO 14496-11    2005     2015   Scene description and (Java) application engine  (updates parts of Part 1
Part 12   ISO 14496-12    2004     2017   ISO base media file format   
                                            (largely the same as ISO 15444-12, JPEG 2000's base format).
Part 13   ISO 14496-13    2004     2004   Intellectual Property Management and Protection (IPMP) Extensions     
Part 14   ISO 14496-14    2003     2010   MP4 file format, a.k.a. MPEG-4 file format version 2. 
                                            (Based on Part 12 and updates (clause 13 of) Part 1)
Part 15   ISO 14496-15    2004     2020   details of carrying Part 10 videos in a Part 12 style container (verify)
Part 16   ISO 14496-16    2004     2016   Animation Framework eXtension (AFX), describing 3D content
Part 17   ISO 14496-17    2006            Streamed, timed subtitle text
Part 18   ISO 14496-18    2004     2014   Font compression and streaming (for Part 22 fonts)
Part 19   ISO 14496-19    2004            Synthesized texture stream, for very low bitrate synthetic video clips.
Part 20   ISO 14496-20    2006     2010   basically a variant of Scalable Vector Graphics (SVG)
Part 21   ISO 14496-21    2006            a Java API for building multimedia apps
Part 22   ISO 14496-22    2007     2017   Open Font Format  (basically OpenType 1.4)
Part 23   ISO 14496-23    2008            Symbolic Music Representation (SMR)
Part 24   ISO 14496-24    2008            Audio and systems interaction 
                                            (details to putting MPEG-4 Audio in the MPEG-4 file format)
Part 25   ISO 14496-25    2009     2011   3D Graphics Compression 
Part 26   ISO 14496-26    2010     2016   Audio Conformance     
Part 27   ISO 14496-27    2009     2015   3D Graphics conformance     
Part 28   ISO 14496-28    2012            Composite font representation     
Part 29   ISO 14496-29    2014     2015   basically restricted profiles of Part 10 (H.264) video
Part 30   ISO 14496-30    2014            more subtitle related stuff (timed text and other visual overlays)
Part 31   ISO 14496-31    not finished    Video Coding for Browsers (VCB)
Part 32   ISO 14496-32    not finished    Conformance and reference software     
Part 33   ISO 14496-33    not finished    Internet video coding

A bit more real-world

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

"okay, beyond the standards, what parts of MPEG do I mostly care about?"


You might focus on the more popular codecs, containers, and products, e.g.

MP3 (MPEG-1, MPEG-2)
AAC (defined in MPEG-4 Part 3, among lesser known things)
MPEG-4 ASP (better known as DivX, XVid)
MPEG-4 AVC (better known as H.264)
The MPEG-4 container format involves multiple parts, in that
Part 1 is effectively v1
Part 14 updates Part 1 and builds on Part 12 (retroactively versioning Part 1 as v1 and Part 14 as v2), and is almost identical to QuickTime File Format
Part 12


Physical disc formats often follow a restricted version of a specific MPEG variant. Consider:

based on MPEG-1 include VCD, XVCD and KVCD
based on MPEG-2 include DVD, SVCD and KVCD
DVDs follow MPEG-2 Program Stream format(verify), with specific restraints on the codec and (effectively) on encoder settings.
BluRay is mostly MPEG4 (and a few MPEG2)
EVO (EVOB) [42] on HD-DVD allows [43] allows MPEG-2 video or MPEG-4 AVC


You can often get away not caring about lesser-used / niche, like

TwinVQ
ALS
various whole Parts


Note that

  • As a name, MPEG-2 can be vague, in that it can refer to
the whole standard (ISO 13818),
or specifically to also specifically to MPEG-2's video compression (MPEG-2 Part 2 a.k.a. H.262 -- itself basically an extension of MPEG-1 Part 2).
  • As a name, "MPEG-4 video" can be vague, in that it can refer to
    • MPEG-3 Part 2 includes MPEG-4 ASP, better known as DivX, XVid, and some others
    • MPEG-4 Part 10, MPEG-4 AVC is known as as H.264 instead (which is its ITU name) to lessen confusion.


Also relevant

which is can be seen as a further development on MPEG-4 AVC / H.264 that gives similar quality at half the bitrate.
and it can be carried in MPEG TS, MPEG-4 containers (and others)
  • MPEG-5 is ISO/IEC 23094-1 (since 2020)
which is mostly a specific video codec (Essential Video Coding, EVC) / LCEVC
which says it does quality similar to H264 at 60% of the bitrate
  • ...and once we're on such comparison, also VC-1, VP8 and VP9, and such





See also:

MPEG audio codecs

MPEG1(/2/2.3) Audio Layers

An MP3 file is an audio elementary stream based on (not wrapped in PS or TS), and without a header. (though I've seen a few ADTS streams with .mp3 extension)

so detecting something as an MP3 becomes more of an "if you can detect a few valid, consistent packets back to back" thing
https://en.wikipedia.org/wiki/MPEG_elementary_stream#General_layout_of_MPEG-1_audio_elementary_stream

The audio stream itself is the one mainly defined in MPEG-1 Part 3. For music we almost exclusively user Layer III (which MP3 is named for) because it defines the higher bitrates/sample rates than layer I and II.

MPEG-2 (BC) audio added a few lower bitrates making it slightly more flexible but also only rarely useful (e.g. not for music).

not 100% backwards compatible, but the change to decoders is so small that to most players, MPEG-1 audio and MPEG-2 audio streams are almost identical in practice.

There is also unofficial "MPEG 2.5", which extends the options on the lower bitrate end. This is another step in 'probably supported, but not technically backwards compatible'. It is also not in the standard.

MPEG-2 (BC) audio adds 5.1 channels, but designed around MPEG-1 core to the point an MPEG-1 decoder will play the stereo channels and ignore the rest(verify)



AAC

Effectively has a few version

MPEG-2 Part 7, defines three profiles
AAC-LC / LC-AAC (Low Complexity)
AAC Main
AAC-SSR (Scalable Sampling Rate) [44]
MPEG-4 Part 3 adopts and extends MPEG-2's AAC
minor changes and more profiles
AAC-LD (Low Delay)
Later added HE-AAC (sometimes aacPlus), a higher-efficieny variant
Later yet HE-AAC v2, that added parametric studio

See also MPEG-4 Part 3 Audio Profiles

There were essentially all expansions of the previous, so you can have one decoder for all.


MPEG-4 DST

Direct Stream Transfer ?


MPEG-4 HVXC

Speech
https://en.wikipedia.org/wiki/Harmonic_Vector_Excitation_Coding


MPEG-4 ALS

Audio Lossless Coding
https://en.wikipedia.org/wiki/Audio_Lossless_Coding


MPEG-4 SLS

https://en.wikipedia.org/wiki/MPEG-4_SLS


MPEG-4 Structured Audio

https://en.wikipedia.org/wiki/MPEG-4_Structured_Audio



Transports


ADTS (Audio Data Transport Stream) refers to an MPEG-2 TS where each frame has an AAC header.


.aac files are often either raw AAC, or AAC in ADTS


LOAS (Low Overhead Audio Stream) / LATM are simplifed MPEG-4 for just audio.

It frequently carries AAC, but can also carry other MPEG-4 codecs.



(Carrying other things, like AC-3?)

MPEG video codecs

MPEG-4 ASP (e.g. DivX) was a good jump forward at the time, a few years later MPEG-4 AVC (a.k.a. H.264) was another.


MPEG-2 sounds old, and at lower bitrates indeed looks much worse than AVC (or ASP), so when pressed for space, or playing video online (higher transfer costs per quality) you would always choose a more space-efficient codec - including some newer than AVC when support is good enough. (Which, because stupid political games, is easily summarized as "not")


At high enough bitrates, most any video codec becomes transparent.

And it matters that more refined encoders did a lot better years later, while still adhering to the same bitstream format.(verify)


A such, when space is not an issue, MPEG-2 is still perfectly serviceable, and it can matter that MPEG-2 may encodes and decodes faster because it's simpler.

But that only applies to a few uses, like streaming where CPU cost is higher than transmission cost (is now basically never, though there are still a few potential uses like "in my event streaming setup)".

The case I'm getting at is Blu-Ray. DVDs and Blu-Rays were basically were both were sized to fit high-bitrate MPEG-2 movies (of different resolutions, because different eras)(verify).

Given Blu-Rays are large and a disc is for a single movie, you can throw bitrate at which at which MPEG-2 going to be just as transparent as AVC or VC-1 (and any differences are more down to codec-specific artifacts, or probably more from things like film transfers).

Which is why it actually makes a lot of sense to support MPEG-2 in Blu-Ray discs.

At the same time, while both may be equally fine for video, most Blu-Rays are AVC[45], probably for more practical reasons.

Container format notes

MPEG 1, 2, and 4 (MPEG in general) defines a number of stream types, which work out to be fairly general-purpose.

There are restricted forms in various places, often to guarantee playability on hardware players.


MPEG 1 and 2

Elementary Streams, Transport Streams, Program Streams:

An (MPEG) Elementary Stream is a single type of data, e.g. audio, or video.

They will typically be the output of a single audio encoder, video encoder. They are streams in that any one stream can be given to the according decoder.

It may or may not have a header explain exactly what format the data is in

And neither encoder or decoder needs to think about how this is stored or multiplexed or such - that's up to the container.


A Packetized Elementary Stream (PES) is such an elementary stream split into smallish packets, largely so that you can guarantee you go through it using a smallish buffer

The structure of an elementary stream depends on the codec being used, and even the presence of a header at the start is effectively optional, yet common for practical reasons.

For example:

  • An MPEG-2 video elementary stream will start with a header
e.g. 00 00 01 B3
https://en.wikipedia.org/wiki/MPEG_elementary_stream#Header_for_MPEG-2_video_elementary_stream
  • An MP3 file is a single audio elementary stream (not wrapped in PS or TS), and without even a header
It isn't strictly a PES either(verify), but if your goal is using a small buffer, the way it splits up into frames amounts to roughly the same thing


When playing back and/or transmitting streams (video, audio, captions, other), you often need to play them back synchronised.

MPEG2 makes a distinction between

Transport Streams (TS)
fixed bitrate, makes them easier to buffer, and easier to resync during less-reliable context, and more restrained demands on the decoder which is nice for hardware playback
at the cost of some coding efficiency
has some more error correction and synchronization pattern?(verify)
Program Streams (PS)
variable bitrate, can be more space-efficient
at the cost of buffering being a little more complex, and less resistance to unreliable media, potentially higher demands on the decoder.


Things like DVD, and most MPEG-2 video files you'll find, will be PS.

You'll find TS around broadcast, e.g. used in DVB.


Both can carry multiple PES streams, but in PS they are part of the same program, and TS can carry multiple programs (is this used, though?)(verify)


Syncing is a topic of its own.

In video production (e.g. studios) you would often use genlocking, in which the video signal itself is used to sync another.

In transmission this is less practical and less flexible, and you often set up a clock source by description, and sync separate streams to that via their timestamps. This is what a Program Clock Reference (PCR) is about.

In a Single Program Transport Stream (SPTS) there is a single PCR channel.

In general, different programs may have to be synchronized differently.



In PS the timebase can be known ahead of time, and things fetched based on that timebase, and streams are implicitly on the same clock, because of how.

In TS


While a distinction could be made between MPEG-1 PS and MPEG-2 PS (being defined in MPEG-1 Part 1 see MPEG-2 Part 1), they seem to be so similar that a lot of code reads both{verify}} (though what they contain varies).

Transport Streams seem to have been introduced in MPEG-2(verify) defined as part of MPEG-2 Part 1, so TS implicitly refers to MPEG-2.

Also, the distinction seems to be mostly an MPEG-2 thing, with MPEG-4 presumably being more fine grained?(verify)


MPEG-2 Tansport Stream (TS)'s Transport Packets (TP) have a fixed size of 188 bytes, so you can be fairly sure you didn't lose sync if you skip 188 bytes and see another synv byte (0x47)

In a Single Program Transport Stream (SPTS), there will be one Program Clock Reference (PCR) channel that recreates one program clock for both audio and video. The SPTS


MPEG-2 PS



In MP3 you are typically dealing with MPEG-1, sometimes MPEG-2, have sync bits and a header that lets you be pretty sure this one is correct, and that the thing that follows is also a frame.

http://www.img.lx.it.pt/~fp/cav/Additional_material/MPEG2_overview.pdf

MPEG-4 container

Uses a TLV style affair, with

a uint32 size (including the size and type, and contained boxes)
a uint32 type (often readable ASCII)
fields(chunksize-8 bytes of it)
possible contained boxes

MPEG-4 calls them Boxes (previously atoms), which helps visualize how they can be can be nested into a tree structure.


ftyp

  • mandatory (but if missing, should be parsed as if there is an ftype with mp41 brand)
  • major brand - uint32 (usually printable text) that lists the type of content
  • minor version - uint32, minor type to the brand
  • compatible brands -
  • An MPEG-4 stream put an ftyp box as early as possible, and in a file it is typically at the start, which is also useful as file magic (look for 'ftyp' at index 4. The size of this box vary a little due to the compatible-brand list). The four bytes after ftyp help identify the more specific kind of file it is. For example:
isom MP4 Base Media v1 (Part 12)
mp71 MP4 with MPEG-7 metadata
mp41 MP4 v1 (Part 1)
mp42 MP4 v2 (Part 14)
qt   Apple QuickTime
mmp4 3GPP ('mp' referring to Mobile Profile) (there are a handful more 3GPP variants)
There are various more, see also ftyps.com


As an indication of nesting, a relatively minimal video file's first two levels may look something like

  • ftyp
  • moov
    • trak (details about the video track)
    • trak (details about the audio track)
  • mdat (no boxes; contents are referred to via sample tables under trak)


https://standards.iso.org/ittf/PubliclyAvailableStandards/c068960_ISO_IEC_14496-12_2015.zip

3GPP

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

3GPP (3rd Generation Partnership Project) is a wide term, grouping various mobile-related development.

In this context we perhaps care most about the MPEG-4 style container as used in mobile context (often referred to as 3GP and 3G2), which basically takes the MPEG-4 container format and removes some features you don't need there (making it easier to implement), and adds some things useful in this context.

https://en.wikipedia.org/wiki/3GP_and_3G2

MPEG streams can contain...

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

In practice it's more relevant what a stream can contain, and where you can use it (than how it's structured).

A MPEG-1 Program Stream can contain (presumably(verify))

  • MPEG-1 Part 2 video
  • MPEG-1 Part 3 audio

A MPEG-2 Program Stream can contain

  • MPEG-1 Part 2 video
  • MPEG-2 Part 2 video
  • MPEG-1 Part 3 audio
  • MPEG-2 Part 3 audio

And in theory (but rarely in practice)

  • MPEG-4 Part 2 video (also in TS)
  • MPEG-2 Part 7 audio (AAC)
  • MPEG-4 Part 3 audio (AAC)


MPEG-4 can contain

  • As per the ISO 14496 stadard:
    • video: ASP (MPEG-4 Part 2)
    • video: AVC a.k.a. H.264 (MPEG-4 Part 10)
    • audio: MP3 (MPEG-4 Part 3) (verify)
    • audio: AAC (MPEG-4 Part 3)
    • audio: CELP (MPEG-4 Part 3)
    • audio: ALS (lossless) (MPEG-4 Part 3)
    • audio: SLS (MPEG-4 Part 3)
    • audio: Structured Audio (MPEG-4 Part 3) (low bitrate)
    • audio: HVXC (MPEG-4 Part 3) (speech codec)
    • audio: CELP (MPEG-4 Part 3) (speech codec)
    • audio: TwinVQ (MPEG-4 Part 3)


  • Other standards, or proprietary, or haven't figured out yet
    • 3GPP (.3gp) and 3GPP2 (.3g2) are restricted version of MPEG-4 container and/or contents, made to be more easily supported by mobile devices, and can contain
      • video: H.263
      • video: VP8
      • audio: AMR-NB
      • audio: AMR-WB and WB+
      • audio: AAC (AAC-LC, HE-AAC v1, HE-AAC v2)
      • audio: MP3
    • video: VC-1 (mostly just in Blu-Ray)
    • video: HEVC a.k.a. H.265

Other transports

Online streaming is frequently MPEG-DASH[46] ('Dynamic Adaptive Streaming over HTTP'). (Meant as a standard that is less proprietary than Smooth Streaming (Microsoft), HDS (Adobe), or most others.)


DASH which breaks parts into short segments, where each segment (the minimum download unit, and usually a few seconds) can be served at different bitrates.

A manifest is transferred to tell the player where to find segments for each quality.

The client can choose the bitrate it thinks works best (balance between required speed and detail), and can typically switch seamlessly during playback.


You can have DASH in downloaded form, which will presumably be a sequence of moof,mdat fragments (rather than a typical MP4, which can be mostly just one big moov,mdat pair) (presumably it's also more specifically following 3GPP)



https://en.wikipedia.org/wiki/Dynamic_Adaptive_Streaming_over_HTTP

Frame rate, analog TV format, and related

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

(I'm not sure about all this - there is so much fuzzy and/or conflicting information out there)


Frame rates

Movies on media like DVD come in different frame rates. This does not matter to computer playability, so unless you are compressing or converting video, you probably want to ignore this completely.

Common rates

Some of the more common rates seem to be:

rate common uses / suggests source also referred to as approx
24 (exactly) used to shoot most film, and used in most cinema projection film
24000/1001fps usually an intermediate in conversion from film to NTSC color 'NTSC film' 23.976,
23.97
25 (exactly) Speed of rasters transferred (not shown frames) in broadcasts such as PAL (except PAL M) and SECAM. 'PAL video',
30000/1001 the speed of rasters transferred (not shown frames) in (interlaced) broadcasts such as NTSC M (the most common NTSC form) and also also PAL M. Pre-1953 NTSC broadcasts was exactly 30.0fps)}} 'NTSC video' 29.97
30 (exactly) Apparently the black and white variant of NTSC was exactly 30, and 30000/1001 was the hack upon that (verify). Exactly-30fps content is relatively rare(verify), because it's either pre-1953 NTSC TV, or modern digital things that just chose this(verify).


50 (exactly) Can refer to 50 frame per second progressive, or 25 frame per second interlaced that is being played (and possibly deinterlaced) as its 50 contained fields per second (as e.g. in PAL and SECAM TV ((except PAL M)) 'PAL film', 'PAL video', 'PAL field rate'
60000/1001 (verify) The field rate of NTSC color. Can refers to NTSC color TV that is transferring interlaced rasters. 'NTSC field rate'



These are the most common, but other rates than these exist. For example, there is double rate and quad rate NTSC and PAL (~60fps, ~120fps; 50fps, 100fps), often used for editing, or e.g. as intermediates when converting interlaced material.


A framerate hints at the source of the video (24 is typically film, ~25 is often PAL broadcast, 30000/1001 is typically NTSC broadcast) and/or the way it is played (e.g. 50 and 60000/1001 usually means analog TV content, and possibly interlaced). There's a bunch of cases where you can't be sure, because there are some common conversions, e.g. 24fps film converted to 25fps and 29.97fps for broadcast and DVDs. And be careful of assumptions about interlaced/progressive/telecined. Note also that DVDs can mix these.


Movies still mostly uses 24fps, primarily because we're used to it. It looks calmer, and we associate 24fps with higher quality, partly because historically higher framerates remind us of home video and its associated cheapness (also with technical uses like interlaced sports broadcasts).

(Of course, these associations are also entangled with camerawork and other aspects, so it's not quite that simple. It's partly silly, because TV broadcast of 24fps material necessarily involved some less-than-perfect conversions)


On 1001 and approximations

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

NTSC made things interesting.

NTSC existed since 1941 was then exactly 30fps, had no color, and fit in 4.5MHz of usable radio frequency.

In 1953 it was replaced by what we now call NTSC color. Of course, the only TV at the time was black and white, and people wanted it to be backwards compatible with them, so that one transmission could serve both.

For this reason,

  • it was designed to send separate luminance (black and white) roughly as before,
and separate chrominance information.
  • it was designed to use the same frequency band
so chrominance would actually have to overlap the luminance.
Minimizing the interference between the two, with the method they ended up choosing, meant some math that (long story shor) required that the amount of columns, amount of rows, or amount of frames per second had to be shifted a little.

They chose to settle columns and rows to 525 and 286, and fudge frames per second, meaning that that number is

4500000Hz / (525*286)

which, if you remember your fractions from school, happens to simplify to

30000/1001

Which is approximately

29.970029970

So, NTSC color broadcast is never 30fps, it's always 30000/1001.


As an aside, 29.97 is an inaccurate approximation of 30000/1001, but it's off by so little that getting this wrong only starts racking up a noticeable change after an hour or two.

Compared to what, though? Does it matter at all?

It can matter when you processing content where NTSC is involved somehow, such as going between PAL and NTSC hardware/broadcast, so matters to professional transcoding. For typical-youtube-length videos you probably wouldn't even notice if this conversion was done incorrectly.

(Similarly, 23.976 for 24000/1001 (happens in film-to-NTSC conversion) is also slightly off, 23.97 and 23.98 more so. (verify)

(Since the same trick wasn't necessary for PAL, PAL is always 25fps precisely.)

(Also, film to PAL is often just played that ~4% faster.)


In most cases, the difference between the fraction and its approximation is tiny, but will be noticeable over time, in timecode, and probably audio sync, depending a little on how the audio is synced.


You can fix NTSC's timecode issue with Drop-Frame (DF) timecode.

Timecode counts in whole frames, so from its perspective the difference between 29.97fps instead of 30fps is ~0.02997 missing frames per second, which is roughly (30*60*60 - 29.97*60*60=) 108 frames per hour.

Drop Frame Timecode (DF) skips two of the frames from just its own count (these frames do not exist), from nine out of every ten minutes. This happens to work out as exactly (2 frames * (6*9) applicable minutes =) 108. (This technically still makes a rounding error, but it happens to be precise enough for most anything we do)


Non-Drop-Frame (NDF) refers to content without this 1001 NTSC bother in the first place.



Common constraints

When video is made for (analog) broadcast, it is very much constrained by that standard's color and frame coding, and more significantly in this context, its framerate.


When video is made for film, it is limited by projector ability, which has historically mostly been 24fps.

When it is made for NTSC or PAL broadcast, it is usually 30000/1001 or 25, respectively.


Computers will usually play anything, since they are not tied to a specific rate. Even though monitors have a refresh rate, almost all common video will be slower than, meaning you'll see all frames anyway. Computer playing is often synchronized to the audio rate).

Common conversions

Conversion from 24fps film to 25fps PAL broadcast rate can be done by playing the frames and audio faster by a factor 25/24, either letting the pitch be higher (as will happen in analog systems) or the same pitch by using digital filters (optional in digital conversion).

Few people will notice the ~4% difference in speed, pitch or video length.

This does not work for NTSC as the difference is ~25%. Instead, telecining is often used, though it also makes film on TV a little jumpier for the 60Hz/30fps countries (including the US).


Interlacing, telecining and such

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

tl;dr:

  • telecining is mostly relevant when you work with video from analog TV broadcast - if you work in video you should be seeing less and less
  • interlacing also comes from analog TV, but lives on longer (e.g. modern camcorders may still use interlacing to save a little space) - so this is still something you may deal with


Progressive

Progressive means that each frame is drawn fully on screen, and that frames are drawn in simple sequence.


Seems very obvious. Best understood in contrast with interlacing and telecining (pixel packing can also matter somewhat).


Constraints in aether bandwidth and framerates are the main reasons for interlacing and telecining. Computers are not as constrained in those two aspects as broadcasting standards and CRT displays are, and as such, other factors (such as the compression codecs) tend to control the nature of video that is digitally available. This is one reason that much of it is progressive.


Interlacing

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Interlacing comes from a time where TV broadcast was already a thing, and engineers worked to squeeze more video out of already-fixed allocation of bandwidth.


It is also relevant this this was in the CRT days, because interlacing means one refresh updating alternating lines, and the next one the other set of alternating lines.

It considers that the phosphors that we're lighting up, to the human visual system, seems persistent enough, (and that flickering details are more acceptable to us than flickering areas), so this happens to be one of the least-noticeable ways of updating the screen at what seems like twice the speed and - not unimportantly - TVs that didn't know about interlacing at all would still do an entirely sensible thing too. So it was genuinely clever, given the constraints of the time.


This meant that you could e.g. send the same 25 frames. If treated as 25 frames to show this would look okay, if treated as 50 half-the-information frames it would look faster.

Once TVs were capable of both, the broadcaster had the option of doing either.

Conceptually, they could now choose to:

  • send a source that is 25fps
both fields come from the same point in time of the original video,
essentially building up each full frame from two halves
images get displayed are slower, but the detail looks more coherent
and would e.g. make sense when the source is 24fps film
  • send a source that is 50fps
the two fields in a frame now from different points
both fields come from different times and update at different times
which looks faster, so great for things like sports broadcasts
but has issues, e.g. showing details during fast movement



Why do it, and why not?

If your bandwidth / storage has specific constraints, and you prefer showing motion rather than detail, you would probably want interlacing.

Which was absolutely true for analog TV - and much less in digital video (which has been common for decades), or digital TV (which is a more recent switch).


That said, there is something to be said for interlacing when constraints aren't as strict.

It's a simple compression-like scheme, that works well enough for video content that shows mostly large details and slow movement, in which case deinterlacing can reconstruct something very close to the original.

Advanced de-interlacing algorithms, supported by newer and faster hardware (and making assumptions that are true for most video), can bring quality back to levels that are surprisingly near the original, for most but not all content.


One good reason against interlacing is that it is inherently lossy, so maybe it should only be a broadcast format, not a storage format.


Another is that when digital video compression is involved, it makes things messier - codecs tend to not deal well with interlaced video, and badly if you didn't tell it it is interlaced.


While with SDTV broadcast, interlacing was just the way it worked for archaic reasons, with HDTV it's optional - and a specific choice.

HDTV does see use of interlaced programs, and HD camcorders still choose to do it, and mostly for the same reasons as before - saving bandwidth (but now in terms of network use and storage, instead of RF).

Yet it's also less significant, because while it does save space/bandwidth, it's not half, and there's now always compression anyway.

Also, when your job involves editing and/or compressing video, interlacing means extra steps, extra processing, and extra chance of specific kinds of artefacts.


Some more gritty

Note that depending on how (and where in the processing chain) you look at the content, you could refer to interlaced content's rate either by the tranferred rasters or the transferred fields - the transferred framerate or the shown framerate (30000/1001 or 60000/1001 for NTSC, 25 or 50 for PAL). This leads to some confusion.

From this perspective of displaying things, each image (raster) that is received/read contains two fields (or frames - not everyone is consistent with the terms - and neither am I), shot at slightly different times, and updated on the display at different times.


This is not the only relevant view. Things become a little funnier around digital video. Consider that NTSC is 30 rasters per second regardless of what it contains, and it's simpler for digital video to just store those rasters.

Digital video will often not see interlaced video as having a doubled framerate. If you' just show that as-is, there is no longer a difference between rasters and frames, and you will show two (chronologically adjacent) half updates at the same time.

Which looks worse than progressive original on TV because while you're showing the same content, the interlacing lines are more apparent than on TV's phosphors.

For this reason, de-interlacing is often a good idea, which actually refers to a few different processes that make the video look better.


Interlacing in general and in analog systems general happens at the cost of display artifacts under particular conditions: while interlacing is usually not particularly visible on TVs, specific types of resulting problems are brought out by things such as fast pans (particularly horizontal movement), sharp contrast and fine detail (such as small text with serifs, computer output to TV in general), shirts with small stripes (which can be lessened with some pre-broadcast processing).


Interlacing is also one of a few reasons that a TV recording of a movie will look a little less detailed than the same thing on a (progressive) DVD, even when/though DVDs use the same resolution as TVs (other reasons for the difference are that TV is analog, and involves various lossy steps coming to your TV from, eventually, the original film material).


For notes on telecined content, see below. For now, note also that interlacing is a technique applied at the playing end (involves taking an image and playing it as two frames), while telecining just generates extra frames initially, which play progressively (or have inverse telecine applied to it; see details).



Note that when a storage format and player are aware of interlacing, it can be used smartly again. For example, DVDs may mix progressive, telecined, and interlaced behaviour. The content is marked, and the player will display it accordingly. Interlaced DVD content is stored in the one-image-is-two-frames way, and the player will display it on the TV in the double-framerate-half-updates way described above.


Deinterlacing
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Deinterlacing takes interlaced material and produces a progressive-scan result.

Often applied to reduce interlacing's visual artifacts, particularly jagged edges (a.k.a. saw tooth edge distortion, mice teeth, combing, serrations), less noticeable, for display or for (re)compression (as lossy video compression that isn't counting on interlaced video deals with it quite badly).

Note that in some cases deinterlacing reconstructs the original. In most cases it is an inherently lossy process, in that it throws away data and isn't exactly reversible - but may be worth it perceptually.


There are also smarter and dumber ways of doing deinterlacing (most of those detailed below are the simple and relatively dumb variants), and the best choice depends on

  • whether you are doing it for display or storage/(re)compression
  • whether you displaying on a CRT (phosphor and scanline) or something else with output restrictions, or a digital display (which tends to have few restrictions)
  • the nature of the video before interlacing (are the two fields camptured at different times or not?)

You may like to know that:

  • Analog TV has to adhere to broadcast standards from the 1930s and is interlaced
    • ...but whether the two fields in a frame are from different times varies
    • in PAL countries most films are not shown interlaced - the two fields come from the same film frame (the 25/24 discrepancy is fixed by speeding up the film that much)
    • in NTSC countries, film is likely to be telecined (similar to interlacing)
    • sports and such is taken at 50/60fps and sent as that many fields (so shown as that many interlaced half-frames), as it looks smoother.
  • different types of camcorders may store half-height interlaced video, or may store progressive frames (depends on goal an quality).
  • TV capture cards usually take the TV signal as-is, so tend to return still-interlaced video (verify)


For the examples below, consider two video sources transmitted through interlaced broadcast:

  • video A: a film at 25 frames per second. Two two fields complement each other for the exact original frame
  • video B: sports footage, or video from an interlacing camcorder. The 25 frames per video second has its two fields from video taken 1/50th of a second apart.

The text below will mix 'field per second' and 'frame per second' (and sometimes fps for progressive, where there is no difference), so pay attention and correct me if I get it wrong :)


Weave

Weaving creates progressive frames by showing both fields at the same time, still interlaced.

Yes, this is a fancy name for 'doing nothing', other than copying each line from the alternate frame.

It's simple to do, and it's fast. Technically it retains all the video data, but in a way that doesn't look good in various practice.


Weaving video A means we reconstruct the video at its original resolution and frame rate, and can show it as such digitally.

Weaving video B means we construct a 25fps video with jagged lines (when there is fast movement). On digital display this tends to be more noticeable than on CRTs, so for some purposes you would be better off with any deinterlacing that tries for 50fps output. In particular video compression, as most codecs don't deal so well compressing such one-pixel details, so this will typically give lower-quality encodes (for same-sized output).


Digital capture from interlaced media (e.g. a TV capture card) will often capture in a way that effectively weaves frames, which is why you get the jagged-movement effect if the video is never deinterlaced.

Discard

When displaying video on a device with too little processing power, an easier-to-code and faster method is discarding every second field (a.k.a. single field mode) and drawing lines from the other twice (line doubling), to get output that has the same size and same framerate as the original.

On video A we throw away half the vertical detail (and produce 25fps video).

On video B we throw away half the vertical detail and half the time material as well (and produce 25fps video).

Compare with bob. (basically, discard is half a bob)


Blending

You can use all the data by blending (a.k.a. averaging, field combining) both fields from a frame into a single output frame.

For video A, you would produce an (unecessarily) blurred version of the original. Weave may be preferable in various cases.

For video B you would create a 25 frame per second version with less jagging, but motion will have a sort of ghosting to it. Better than weave in that the jagging isn't so obvious, but everything will be blurred.


(Note that sizing down video has a similar effect, and can even be used as a quick and dirty way of deinterlacing if software does not offer any deinterlacing options at all)


Bob

Bob, Bobbing (also 'progressive scan', but that is an ambiguous term) refers to taking both fields from the frame and displaying them in sequence.

Video A would become line-doubled 50 frames per second. Stationary objects will seem to bob up and down a little, hence the name. You're also doubling the amount of storage/bandwidth necessary for this 25fps video while reducing its vertical detail.

Video B would be shown with its frames at its natural fluid 50 frames per second (note that most other methods would create 25 frame per second output). (note that if you take the wrong lines first, you'll put the in the wrong order and the video looks a bit twitchy)


Bob-and-weave and optional cleverness

Bob and weave refers to the combination of bob and weave.


Telecine

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Telecine (portmanteau of 'television' and 'cinema') refers to the process of converting video between these two worlds -- in a wide sense, so can refer to various distinct methods.


For a long while, the most common were:

  • conversion from film to PAL television (25fps)
often by playing the frames at 25fps and the audio a little faster
  • conversion from film to NTSC television (30fps, adding intermediate frames)
specifically three-two pull down - in some contexts telecine is used as a near-synonym for 3-2 (which is a little misleading, and this page is also guilty of that)


Frame rate conversion from 24fps film to 30000/1001 broadcast NTSC is usually done using three-two pulldown, which uses a constant pattern of using some frames as-is, some interlace-like half-updating frames. This variant of a more general technique turns groups of 4 frames into 5, which for 24fps input means an extra 6 frames per second.

Since you end up with 30 frames for a second, which look approximately the same as the 24fps original, the audio speed can stay the same. It still means leaving some frame content on-screen longer than others, which is visible in some types of scenes. For example, a slow smooth camera pan would be shown with a slight judder.

Three-two pulldown is inversible, in that you can calculate the original frames from a stream of telecined frames, though if content is spliced after three-two-pulldown is applied, you may lose the ability to reconstruct a frame or two

Video editing would probably choose to decode to the frames.

Potentially, so might playback on monitors, because the original frames may be a little lower on the judder.

Inverse telecine is also useful when (re)compressing telecined video, since many codecs don't like the interlace-like effects of telecining and may deal with it badly.


Hard telecining' refers to storing telecined content e.g. 30000/1001 for NTSC generated from 24fps film content, so that the given content can be played (progressively) on NTSC equipment. The upside is that the framerate is correct and the player doesn't have to anything fancy, the downside is that it usually has negative effect on quality.

Soft telecining refers to storing video using the original framerate (e.g. 24000/1001 or 24fps film) and flagging the content so that the player (e.g. box-top DVD players) will telecine it to 30000/1001 on the fly. NTSC DVDs with content originating from 24fps film usually store 24000/1001fps(verify) progressive, with pulldown flags set so that the DVD player will play it as 30000/1001fps.


See also:

Mixed content

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Most container formats aren't picky about the exact contents and will allow contain mixes of different frame rates, a mix of progressive and telecined, progressive and interlaced, and sometimes even all three.


When you want to edit video content further, you probably want to figure out whether a video is progressive, interlaced or telecined.

One way to do this is using a player that allows per-frame advancing (such as mplayer). Make sure it's not applying filters to fix interlacing/telecining, find a scene with movement (preferably horizontal movement/panning), and see whether there are interlace-style jaggies.

  • If there are none, it is progressive (or possibly already deinterlaced by the player)
  • If there are in every frame, it is interlaced
  • If there are in only some frames, it is telecined (two out of five in 2:3, 24-to-30fps telecine).

Note that things like credits may be different (apparently often telecined on DVDs).


While telecining uses regular patterns of extra frames, splicing after telecining means the video will usually not follow that pattern around a splice, meaning that inverse telecine may not be able to decode all original frames. This is often the cause of encoders/players complain about a few skipped and/or duplicate frames in a movie's worth of frames, and you can ignore this - hardware players do the same.

See also (interlacing, telecining)

Deinterlacing:

Telecining:

Deinterlacing, telecining:

Various:

(Analog) TV formats

There are a number of variants on NTSC, PAL and SECAM that may make TVs from different countries incompatible. NTSC is used in North America and part of South America (mostly NTSC M), and Japan (NTSC J).

PAL is used in most of Europe, part of South America, part of Africa, and Asia. SECAM is used in a few European countries, part of Africa, and Russia.

PAL M (used in Brazil) is an odd one out, being incompatible with other PAL standards, and instead resembling NTSC M - in fact being compatible in the monochrome part of the NTSC M signal.


While NTSC and PAL are largely the same inside (PAL fixed one color consistency bug that NTSC was stuck with because the standard was already set), so differ mainly in frame rate and amount of lines.

CRT TVs often support just one of these, as it would be complex (and imperfect) to convert more than one, and few people would care for this feature as most people had one type of broadcast around.


It should be noted that the actual broadcast signal imagery uses more lines than are shown on the screen. Of the video lines, there are fewer that are the raster, the imagery that will be shown on screen.

  • 525-scanline video (mostly NTSC) has 486 in the raster, and many show/capture only 480(verify)
  • 625-scanline video (mostly PAL) has 576 in the raster

The non-raster lines historically were the CRT's vertical blanking interval (VBI), but now often contains things like teletext, closed captioning, station identifiers, timecodes, sometimes even things like content ratings and copy protection information (note: not the same as the broadcast flag in digital television).

Video recording/capture will often strip the VBI, so it is unlikely that you will even have to deal with it. Some devices, like the TiVo, will use the information (e.g. respect copy protection) but do not record it (as lines of video, anyway).

Devices exist to add and alter the information here.


PAL ↔ NTSC conversion

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Note that various DVD players do this, although others do not, and neither the fact that they do or that they don't is necessarily advertized very clearly.


PAL to NTSC conversion consists of:

  • Reducing 625 lines to 525 lines
  • creating ~5 more frames per second

NTSC to PAL conversion consists of:

  • increasing 525 to 625 lines
  • removing ~5 frames per second


The simplest method, that cheaper on-the-fly conversion devices often use, is to duplicate/omit lines/frames. This tends to not look great (choppiness, and cropping or stretching the image).


Linear interpolation (of frames and lines) can offer smoother-looking motion and fewer artifacts, but are more computationally expensive, and have further requirements (such as working on deinterlaced content).

Fancier methods can use things like motion estimation (similar to fancy methods of deinterlacing)

Digital / HD broadcasting

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

ATSC in the US, DVB in Europe


See also:


See also (frames and formats)


Semi-sorted

On types and groups of frames

In MPEG-like codecs (DivX/Xvid, H.264, and more), there is a choice to encode each frame as a/an...

  • I-Frame (inter-frame)
    • a frame you can decode purely from its own data
    • larger than predictive (B- and F-)frames when B/F can predict differences to adjacent frames fairly well (e.g. motion and such), which is usually. When predictive frames don't work well, such as camera cuts, I-frames are preferable.
    • having a guaranteed I-frames every so often helps faster seeking, because seeking often means "look for the most recent I-frame, and start decoding all video until you reach the requested frame"
  • P-frame (predictive)
    • uses information from previous frame
    • ...which is typically less information than a complete I-frame, so makes for better compression
  • B-frame (bidirectional predictive)
    • uses information both from previous and next frame
    • which ends up being less information than forward-only prediction
    • does better on slow motion(verify)
    • more complex to encode
    • more complex to decode

There also used to be D-frames, in MPEG-1, which were basically low-quality easy-to-decode I-frames, which allowed fast preview while seeking.


A GOP (group of pictures) is a smallish group of adjacent frames that belong together in some sense.

A GOP starts with an I-frame, and may contain P-frames, B-frames, and technically can also contain I-frames (a GOP is not defined by position of I-frames, but by being marked as a new GOP in the stream. That said, multiple-I-Frame GOPs are not common)(verify).

Most encoders use I-frames (so start a new GOP)...

  • when it makes sense for the content, such as on scene cuts where a predictive frame would be lower quality
  • once every so often, to guarantee seeking is always fastish by always having a nearby I-frame to restart decoding at
  • other frame-type restrictions. For example, encoders are typically limited to use at most 16 B-frames in a row, and at most 16 P-frames in a row (which gets more interesting when you mix I, P, and B-frames)


To make things more interesting, the frame store order, the decode order, and the display order can all be different. This is largely because (and when) a B-frame can by its nature not be decoded without having the two adjacent frames already decoded, so they are usually stored surrounded by I or P [47]


If your video consists entirely of I-frames, you could call that GOP-less, or size-1-GOP. This is much larger for the same video, but means all seeking is fast, so is nice when editing video, and when you want fast frame-by-frame inspection in both directions, for example if you want to study animation.


A closed GOP is one that can be decoded completely without need of another GOP - basically, ends in a P-frame.

This is contrasted with an open GOP, which ends in a B-frame so which needs to look into the next GOP's first frame (an I-frame).

Open GOPs make for slightly more efficient coding, because you're using a little bit more predictive coding, and in most cases you're just playing all of it anyway.

Closed GOPs, make for slightly easier decoding, and for slightly faster seeking for some frames.


Standardized formats, particularly for hardware players have guidelines that usually restricting to relatively small GOP sizes, on the order of 5 to 30 frames.

For example, the specs for DVDs apparently says 12 frames max, which is about half a second worth of video, in part to guarantee seeking in reasonable steps. Bluray seems to say 24 and/or 1 second (though it depends a little(verify))


Allowing large GOPs (say, 300 frames, ~10 seconds) makes for slightly more efficient coding in the same amount of bytes (because you can make a few frames predictive rather than forcing an I-frames), but it's a diminishing-returns thing.


Keep in mind that various video players seek faster by going to the nearest I-frame to the requested frame rather than the requested frame itself. Which is why they won't always jump by the same amount, won't even do fine seeking, and with very large GOPs may act a little awkwardly.




To inspect what frame types a file has:

  • People mention ffprobe -show_frames but this seems to be a deprecated option.(verify)


  • libavcodec has a vstats option, which writes a file in the current directory with frame statistics about the input file. For example:
mplayer -vo null -nosound -speed 100.0 -lavdopts vstats input.avi

(Without -speed it seems to play at the video rate, and there's probably a way around that better than a -speed that it probably won't reach)

  • or ffmpeg:
ffmpeg -i input.avi -vf showinfo -f null -



Some notes on aspect ratio

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Display Aspect Ratio (DAR) means "this ought to be shown at this ratio". Example: 16:9. This is information that some files can store to tell the player to do so (or that standards can imply for all files of a type).

DAR matters allows more arbitrary aspect ratio in the actual pixel dimensions - for example, SVCDs are 480x480 (NTSC) or 480x576 (PAL/SECAM) -- but store content meant for 4:3 or 16:9 display. Which is one way to store more vertical than horizontal detail. SVCD players will rescale this to some resolution that has the proper aspect ratio at play time (usually just fitting it in the largest non-cropping size for the given TV resolution(verify)).


This works because MPEG can store aspect ration information, so hardware players and most software players listen to and use it. Not all (software) players understand it in MPEG4 files yet, though.

AVI (and any content stored in it, including MPEG4) does not support it -- but the opendml extension that does allow it is now fairly commonly used. Not all players know about opendml, though most that matter do.

When encoding, the best choice depends on what you want to play things on. The most compatible way is rescaling so that square pixels would play correctly. However, this usually means smallish resolution changes which can look like mild blurs .

Some notes on frame rates

See also #Frame_rate.2C_analog_TV_format.2C_and_related for basic notes on frame rate.


Motion interpolation

Motion interpolation refers to taking video and inventing intermediate frames

mostly used to makes video look more fluid
sometimes to fake motion effects,
sometimes compensates for motion blur,


HDTVs do this roughly because they are frequently fed lower-fps video than their panel can display, and they might as well try to make it look better (for some value of better).

It does invent new details creatively, and when scrutinized does do weird things, but in full motion video you are relatively unlikely to see those flaws.


That said, it does alter everything fed through the TV, and in particular film makers don't like that you will never see quite what they made. Which is in part about artifacts, but also because that as a side effect, it may look less cinematic - more like video than film.


There are seemingly endless names for it, some of them:

(Panasonic) Intelligent Frame Creation
(LG) TrueMotion (not to be confused with the video codec)
(Samsung) Auto Motion Plus
(Amazon Fire TV) Motion Processing


https://en.wikipedia.org/wiki/Motion_interpolation

Resolution names/references

This stuff is vaguer and/or more confusing than it should be.


Consider 2K.

  • Around cinema it means DCI 2048x1080 {{comment|(and some cropped variants, 1998x1080, 2048x858)
  • Around PCs it means 2560x1440. Except when it doesn't.


And then consider lists like WSXGA+, WQXGA, WQXGA+, WQSXGA, WUQSXGA, CWSXGA, WQHD.Without looking it up, do you know any of their resolutions? And which one of them doesn't exist?

Without looking it up, what'd the difference between QuadHD and Quad FullHD?

If you can answer either of those questions, you're a hardware freak, the proverbial 1%.


And then there's the fact that naming one number has different meaning from context and industry

  • Cinema resolutions are defined by DCI.
which is basically named by horizontal resolution fairly precisely
  • On PCs
people seem to have settled on a "agree to disagree" vagueness around some references
Take 2K. it's horizontal resolution, yet can be and is explained as both
"where the width is approximately 2000" or
"...starts width a 2, i.e. falls into the 2000...3000 pixel range" (except that is not the convention for 4K, 5K, 8K, 10K, or 16K)
It can mean 2560x1440, which it often seems to do when you say "2K monitor"
Or 2048x1080 (some diagrams specifically call this 2K, and call 2560x1440 WQHD and not 2K)
Or 1920x1080 (a.k.a. 'Full HD') because it's close enough. (Or not, with a "it was an existing thing when we wanted 2K to mean a different thing" argument)
Or maybe sometimes 2048x1536
and technically includes Cinema 2K
  • TV style resolutions are typically referred to by their vertical, shortest edge
in part because historically, the interlaced/progressive difference was important, and the larger horizontal resolution could be implied from the vertical, largely because there were only a few
  • I've worked in a field where sensors are usually square but if not we'd often use the shortest edge because it was limiting


TV

References like 480i and 720p became more commonly used in the era commonly known as HD (now), partly just because it's brief and more precise.

(These references are not often seen alongside monitor resolutions, perhaps because "720p" and "1080p HD" is a new name that is easier to market when you don't realize that a good deal of CRT monitors have done such resolutions since the nineties.)


References such as 480i and 720p refer to the vertical pixel size and whether the video is interlaced or progressive.

The common vertical resolutions:

  • 480 (for NTSC compatibility)
    • 480i or 480p
  • 576 (for PAL compatibility)
    • 576i or 576p
  • 720
    • always 720p; 720i does not exist as a standard
    • 1280x720 (sometimes 960x720)
  • 1080 (HD)
    • 1080i or 1080p
    • usually 1920x1080


There are some other newish resolutions, many related to content for laptops/LCDs, monitor/TV hybrids, widescreen variations, and such.


HD TV broadcasts for a long while were typically either 1080i or 720p. While 1080i has greater horizontal resolution 1920x1080 versus 1280x720), 720p does not have interlace artifacts and may look smoother.


The 480 and 576 variants usually refer to content from/for (analog) TVs, so often refer to more specific formats used in broadcast.

  • 576 often refers to PAL, more specifically:
    • analogue broadcast TV, PAL - specifically 576i, and then often specifically 576i50
    • EDTV PAL is progressive, 576p
  • 480 often refers to NTSC, more specifically:
    • analogue broadcast TV, NTSC - specifically 480i, and then often specifically 480i60
    • EDTV NTSC is progressive, 480p
  • 486 active lines seems to refer to older NTSC - it now usually has 480 active lines

There is more variation with various extensions - widescreen, extra resolution as in e.g. PALPlus, and such.



Sometimes the frame rate is also added, such as 720p50 - which usually refers to the display frequency applicable.

In cases like 480i60 and 576i50 you know this probably refers to content from/for NTSC and PAL TV broadcast, respectively (...though there are countries with other res-and-frequency combinations).

See also:


Analog horizontal pixel resolution is approximate

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

For analog TV, pixel-per-line resolution is not really set in stone. Because of the way the signal is used, anything above ~500 or so looks good enough.

  • Cheaper NTSC CRTs couldn't really display more than 640, cheaper PAL CRTs (verify)
  • 720 was treated as a maximum (fancy editing systems of the time supported it)
  • 704 is a sort of de facto assumption of the average that TVs tend to display(verify), and is also what EDTV uses (704x576 for PAL and 704×480 for NTSC)

As such,

  • NTSC can be 720x480 or 704x480, or 640x480,
  • PAL can be 720x576 or 704x576,

depending a little on context.


On digital broadcast, a stream has a well-defined pixel resolution, but since the displays are more capable, they are typically quite flexible in terms of resolution and frame rate.


Relevant acronyms here include

  • ATSC (digital broadcasting in the US, replacing analog NTSC)
  • DVB (digital broadcasting in Europe, replacing analog PAL)
  • EDTV (sort of halfway between basic digital broadcast and HDTV)
  • HDTV


More resolutions

See e.g. the image on http://en.wikipedia.org/wiki/Display_resolution


Screen and pixel ratios

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


DPI and such

Video capture hardware

Video editing hardware

Webcam/frame-oriented software

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

(What you may want to look for in) more-than-webcam software

Mainly for editing

Mainly for conversion

Some specific tools

See also