Binary-to-text coding

From Helpful
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


The basic problem

Using the full range of possible byte values can be problematic when transmitting or storing data. You can't always throw in said data and guarantee it'll arrive / be retrieved verbatim.


Historically a major thing was some transmission lines were not 8-bit clean, e.g. a serial port set to transmit 7-bit chunks, so you'd have to fit 8-bit data in 7 bits, encoding and decoding, to get it across safely.

These days the typical restrictions are e.g. the characters/values allowed in filenames, URLs, HTTP, MIME, transmission methods that interpret control codes, and more.


Some cases have other implications. For example

  • putting data in email meant using printable characters, but also e.g. not being sensitive to mail servers that would insert newlines.
  • extra filename rules present on some filesystems, such as capitalization rules


The basic solution

Escaping would be nice, but that requires both knowing that something is escaped, and agreement on how, and what contexts you want to guard against. When you're going to transform the data anyway, you might as well do it in some way that is standard and well known - also known to be safe in the contexts you want to use it.


The basic solution is to recode data into a smaller set of safe values, a reduced character set, and there are a good number of methods to do just that.

These methods vary mostly in coding efficiency (how much larger the data will get as a result), where they can be safely used, and how complex they are to use, and sometimes how convenient they are for specific applications.


Considering most of these limitations points to use of the ASCII printable characters - values 0x21..0x7E. You can can regularly add space (0x20) to the list, which makes for 95 safe values.

When you really want the lowest common demoninator, you mostly just get the basic alphanumeric characters - A-Z, a-z, and 0-9, for 62 values, which are safe in some of the most restrictive contexts.

In case sensitive contexts it's about 36, but there are few examples where that amount of safety is required. Arguably filesystems - their APIs and some OS interfaces may be case insensitive.


On storage efficiency

Bytes per output character:

  • Hexadecimal (Base16) is 200% of the original (puts 0.5 bytes in an output character)
  • Base32 is ~160% of the original (~0.625 bytes per output character)
  • UUEncoding uses approx. 133-140% of the original size (~0.7 bytes)
  • Base64 is / PEM is 133% of the original (0.75 bytes in an output character)
  • Ascii85 uses ~125% of the original size
  • Base91 uses ~123% of the original size

See below

Examples

PEM encoding

Refers to what the Privacy-enhanced Electronic Mail standard calls 'Printable Encoding' (see RFC 1421).


Codes data into a 64-character alphabet of only printable characters (ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/). It also uses "=" symbol as a padding suffix code (since it codes 6 bits at a time, it might not finish up on a byte boundary).


The earlier spec (in RFC 989) also used * (as a marker of unencrypted data).

Resembles (and is related to) Base64.

Not URL-safe or filename-safe.


See also:

Base64

Treats byte data numerically, recodes it into base 64, then writes it into ASCII text using ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

As it codes 6 bits at a time, it may not end up on a byte boundary, so it specifies the use of = to signify when it doesn't (which it doesn't for exactly 76 characters).

Example: Hello is SGVsbG8=


Resembles PEM. Based on PEM, in fact(verify).

Historically seen a lot in MIME email, Because of historical use in email, base64 specifies a maximum line length of 76 characters.


Base64 is often an easy generic choice for embedding binary data in various other things, be it MIME, JSON, XML, strings in code, and others.

...but

not safe in URLs (because of /, =, and+)
not safe in filenames (because of the /).

There are simple modifications for both, see below.


See also:

  • http://en.wikipedia.org/wiki/Base64
  • RFC 4648 ('The Base16, Base32, and Base64 Data Encodings')
    • which obsoletes the older RFC 3548 ('The Base16, Base32, and Base64 Data Encodings' - which was an unification of the definitions in RFC 1421 (PEM) and RFC 2045 (MIME)))

Similar systems:

  • PEM
  • OpenPGP's ASCII armor
  • UTF-7 and


Modified Base64 for URLs

Avoids characters that can be finicky in context of URL parsing, so you can dump the result in URLs with little headache (also usable as filenames, though there's a smaller alteration for it dealing just with /).


Differences from basic Base64:

  • uses - instead of +
  • uses _ instead of /
  • uses no = padding

...so uses ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_


Modified Base64 for filenames

Avoids /, uses - instead.

...so uses ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+-


Other modified Base64

Base32

Similar to Base64 but codes only 5 bits into ASCII at a time, using ABCDEFGHIJKLMNOPQRSTUVWXYZ234567, and = for padding.


Still more efficient than hexadecimal.

Not strictly URL-safe (nor universally filename-safe) because of the =.

Insensitive to case changes.

Example: Hello is JBSWY3DP


See also:

Hexadecimal (Base16)

That is, each byte value as hex, printed as text, so uses 1234567890abcdef (this system is usually case insensitive)

For example, Hello is 48656c6c6f.

Less efficient storage-wise than Base64 or Base32, but one of the easiest options that is safe almost anywhere (URLs, filesystem names, etc.), and you could easily write the encode and decode code in a few lines.

See also:

  • RFC 3548

Quoted printable

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Can represent any octet with an = followed by its hex-as-ascii value, e.g. =0C


The equals sign (=) is always encoded (=3D) in the encoded form.

Any other value can be used un-encoded, but for most characters in most contexts this provides no extra safety, which is usually the main point of quopri. Context decides what to encode this way.


Usually:

  • printable ASCII (0x21..0x7E) and are not encoded (because they're usually safe)
  • end-of-line characters are usually not encoded
  • spaces and tabs are not encoded unless they appear before a newline (where they may not necessarily be safe)


Quopri is mostly seen in MIME mail, where this can more readable (and space-efficient) than Base64, but only when the data is mostly ASCII text.


Not URL-safe.

Not filename-safe unless specific care is taken to encode all unsafe characters (Most filesystems do allow =), though you may run into the filename-length-limit much faster so there are often better options.


Size compared to the original varies with content. Worst case is 300% of original (every single character encoded), while cases where nothing needs to be escaped is close to 100%. So it's only useful as a guarantee of safety, not as a particularly useful means of binary data transfer.


See also:

UUEncoding

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Was mainly used for UUCP.

Mail is now usually MIME (so Base64), or occasionally yEnc.

Uses  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ (verify)

See also:

BinHex

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

The newer variant is mostly 8-to-6-bit coding, so uses ~133% of the original.


See also

Ascii85

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Sort of a like a Base-85 that manages to codes four bytes in five printable ASCII characters, so a little more efficient than most of the above.

Used in PostScript and PDF.


Has a few mild variations, depending on where it is used (e.g. basic, btoa, and Adobe's).

Based on Base-85, coding these values using:
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu. Extensions allow use of z for runs of zero values, and y for runs of spaces.

Not URL-safe or filename-safe.

See also:

Z85

Variant of Ascii85, basically designed to be a little safer within strings in strings in JSON, XML, and strings in code in most languages.

...but not e.g. filenames.


Uses 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-:+=^!/*?&<>()[]{}@%$#

See also

Base91

Pushing the Base85 idea a little further.

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxuz0123456789!#$%&()*+,./:;<=>?@[]^_`{|}~" (and notably no -' \)

See also

MECE (Multiple Escape Character Encoding)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Multiple escape characters in BASE 52, using 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP for coding and VWXYZ and QRSTU for escapes.

Capable of Unicode.

URL-safe and filename-safe.

Not very byte-efficient, though.

yEnc

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Used on usenet for files.


It seems this cares only about being MIME-safe and otherwise assumes 8-bit clean transfer (which has been usually true for quite a while now).


Uses non-ASCII values, but still a reduced character set (relies on 'Extended ASCII' codepage, so cannot be mixed with other encodings in the same message).


Noticeably more efficient than MIME's Base64, uuencode, and binhex; in many cases the encoded form can be ~102% of the original.

Not a formal standard, and has some flaws that make wide adoption problematic (and even the given application has a few problems).


See also:

Kermit

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

The Kermit protocol includes a way to send 8-bit data over 7-bit lines

See also: