Binary-to-text coding
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
Contents
The basic problem
Using the full range of possible byte values can be problematic when transmitting or storing data. You can't always throw in said data and guarantee it'll arrive / be retrieved verbatim.
Historically a major thing was some transmission lines were not 8-bit clean, e.g. a serial port set to transmit 7-bit chunks, so you'd have to fit 8-bit data in 7 bits, encoding and decoding, to get it across safely.
These days the typical restrictions are e.g. the characters/values allowed in filenames, URLs, HTTP, MIME, transmission methods that interpret control codes, and more.
Some cases have other implications. For example
- putting data in email meant using printable characters, but also e.g. not being sensitive to mail servers that would insert newlines.
- extra filename rules present on some filesystems, such as capitalization rules
The basic solution
Escaping would be nice, but that requires both knowing that something is escaped, and agreement on how, and what contexts you want to guard against. When you're going to transform the data anyway, you might as well do it in some way that is standard and well known - also known to be safe in the contexts you want to use it.
The basic solution is to recode data into a smaller set of safe values, a reduced character set, and there are a good number of methods to do just that.
These methods vary mostly in coding efficiency (how much larger the data will get as a result), where they can be safely used, and how complex they are to use, and sometimes how convenient they are for specific applications.
Considering most of these limitations points to use of the ASCII printable characters - values 0x21..0x7E. You can can regularly add space (0x20) to the list, which makes for 95 safe values.
When you really want the lowest common demoninator, you mostly just get the basic alphanumeric characters - A-Z, a-z, and 0-9, for 62 values, which are safe in some of the most restrictive contexts.
In case sensitive contexts it's about 36, but there are few examples where that amount of safety is required. Arguably filesystems - their APIs and some OS interfaces may be case insensitive.
On storage efficiency
Bytes per output character:
- Hexadecimal (Base16) is 200% of the original (puts 0.5 bytes in an output character)
- Base32 is ~160% of the original (~0.625 bytes per output character)
- UUEncoding uses approx. 133-140% of the original size (~0.7 bytes)
- Base64 is / PEM is 133% of the original (0.75 bytes in an output character)
- Ascii85 uses ~125% of the original size
- Base91 uses 123% of the original size
See below
Examples
PEM encoding
Refers to what the Privacy-enhanced Electronic Mail standard calls 'Printable Encoding' (see RFC 1421).
It also uses "=" symbol as a padding suffix code (since it codes 6 bits at a time, it might not finish up on a byte boundary).
Resembles (and is related to) Base64.
Not URL-safe or filename-safe.
See also:
Base64
Treats byte data numerically, recodes it into base 64, then writes it into ASCII text using
Resembles PEM. Based on PEM, in fact(verify).
Historically seen a lot in MIME email, Because of historical use in email, base64 specifies a maximum line length of 76 characters.
Base64 is often an easy generic choice for embedding binary data in various other things, be it MIME, JSON, XML, strings in code, and others.
...but
- not safe in URLs (because of /,=, and+)
- not safe in filenames (because of the /).
There are simple modifications for both, see below.
See also:
- http://en.wikipedia.org/wiki/Base64
- RFC 4648 ('The Base16, Base32, and Base64 Data Encodings')
Similar systems:
- PEM
- OpenPGP's ASCII armor
- UTF-7 and
Modified Base64 for URLs
Avoids characters that can be finicky in context of URL parsing, so you can dump the result in URLs with little headache (also usable as filenames, though there's a smaller alteration for it dealing just with /).
Differences from basic Base64:
- uses -instead of+
- uses _instead of/
- uses no =padding
Modified Base64 for filenames
Avoids
Other modified Base64
Base32
Similar to Base64 but codes only 5 bits into ASCII at a time, using
Still more efficient than hexadecimal.
Not strictly URL-safe (nor universally filename-safe) because of the =.
Insensitive to case changes.
Example:
See also:
Hexadecimal (Base16)
That is, each byte value as hex, printed as text, so usesLess efficient storage-wise than Base64 or Base32, but one of the easiest options that is safe almost anywhere (URLs, filesystem names, etc.), and you could easily write the encode and decode code in a few lines.
See also:
Quoted printable
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
Can represent any octet with an = followed by its hex-as-ascii value, e.g. =0C
The equals sign (=) is always encoded (=3D) in the encoded form.
Any other value can be used un-encoded, but for most characters in most contexts this provides no extra safety, which is usually the main point of quopri. Context decides what to encode this way.
Usually:
- printable ASCII (0x21..0x7E) and are not encoded (because they're usually safe)
- end-of-line characters are usually not encoded
- spaces and tabs are not encoded unless they appear before a newline (where they may not necessarily be safe)
Quopri is mostly seen in MIME mail,
where this can more readable (and space-efficient) than Base64, but only when the data is mostly ASCII text.
Not URL-safe.
Not filename-safe unless specific care is taken to encode all unsafe characters (Most filesystems do allow =), though you may run into the filename-length-limit much faster so there are often better options.
Size compared to the original varies with content. Worst case is 300% of original (every single character encoded),
while cases where nothing needs to be escaped is close to 100%.
So it's only useful as a guarantee of safety, not as a particularly useful means of binary data transfer.
See also:
UUEncoding
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
Was mainly used for UUCP.
Mail is now usually MIME (so Base64), or ocasionally yEnc.
UsesSee also:
BinHex
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
The newer variant is mostly 8-to-6-bit coding, so uses ~133% of the original.
See also
Ascii85
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
Sort of a like a Base-85 that manages to codes four bytes in five printable ASCII characters, so a little more efficient than most of the above, at 125% of the original size.
Used in PostScript and PDF.
Has a few mild variations, depending on where it is used (e.g. basic, btoa, and Adobe's).
Extensions allow use of z for runs of zero values, and y for runs of spaces.
Not URL-safe or filename-safe.
See also:
Z85
Variant of Ascii85, basically designed to be a little safer within strings in strings in JSON, XML, and most code.
...but not e.g. filenames.
See also
Base91
Pushing the Base85 idea a little further, for 123% of the original size .
See also
MECE (Multiple Escape Character Encoding)
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
Capable of Unicode.
URL-safe and filename-safe.
Not very byte-efficient, though.
yEnc
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
Used on usenet for files.
It seems this cares only about being MIME-safe and assumes 8-bit clean transfer (which has been usually true for quite a while now).
Uses non-ASCII values, but still a reduced character set (relies on 'Extended ASCII' codepage, so cannot be mixed with other encodings in the same message).
Noticeably more efficient than MIME's Base64, uuencode, and binhex; in many cases the encoded form can be ~102% of the original.
Not a formal standard, and has some flaws that make wide adoption problematic (and even the given application has a few problems).
See also:
Kermit
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
The Kermit protocol includes a way to send 8-bit data over 7-bit lines
See also: