Binary-to-text coding

From Helpful
Revision as of 17:51, 22 May 2020 by Helpful (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
Basic problem

Using the full range of possible byte values can be problematic when transmitting or storing data. You can't always throw in said data and guarantee it'll arrive / be retrieved verbatim.


Historically a major thing was some transmission lines were not 8-bit clean, e.g. a serial port set to transmit 7-bit chunks, so you'd have to fit 8-bit data in 7 bits, encoding and decoding, to get it across safely.

These days the typical restrictions are e.g. the characters/values allowed in filenames, URLs, HTTP, MIME, transmission methods that interpret control codes, and more.


Some cases have other implications. For example

  • putting data in email meant using printable characters, but also e.g. not being sensitive to mail servers that would insert newlines.
  • extra filename rules present on some filesystems, such as capitalization rules


Basic solution

Escaping would be nice, but that requires both knowing that something is escaped, and agreement on how, and what contexts you want to guard against. When you're going to transform the data anyway, you might as well do it in some way that is standard and well known - also known to be safe in the contexts you want to use it.


The basic solution is to recode data into a smaller set of safe values, a reduced character set, and there are a good number of methods to do just that.

These methods vary mostly in coding efficiency (how much larger the data will get as a result), where they can be safely used, and how complex they are to use, and sometimes how convenient they are for specific applications.


Considering most of these limitations points to use of the ASCII printable characters - values 0x21..0x7E. You can can regularly add space (0x20) to the list, which makes for 95 safe values.

When you really want the lowest common demoninator, you mostly just get the basic alphanumeric characters - A-Z, a-z, and 0-9, for 62 values, which are safe in some of the most restrictive contexts.

In case sensitive contexts it's about 36, but there are few examples where that amount of safety is required. Arguably filesystems - their APIs and some OS interfaces may be case insensitive.


On storage efficiency

Bytes per output character:

  • Hexadecimal (Base16) is 200% of the original (puts 0.5 bytes in an output character)
  • Base32 is ~160% of the original (~0.625 bytes per output character)
  • UUEncoding uses approx. 133-140% of the original size (~0.7 bytes)
  • Base64 is / PEM is 133% of the original (0.75 bytes in an output character)
  • Ascii85 uses ~125% of the original size

PEM encoding

Refers to what the Privacy-enhanced Electronic Mail standard calls 'Printable Encoding' (see RFC 1421).


Codes data into a 64-character alphabet of only printable characters (
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
).

It also uses "=" symbol as a padding suffix code (since it codes 6 bits at a time, it might not finish up on a byte boundary).


The earlier spec (in RFC 989) also used
*
(as a marker of unencrypted data).

Resembles Base64.

Not URL-safe or filename-safe.


See also:

Base64

Treats binary data numerically, recodes it into base 64, then writes it into ASCII text using
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/


Because of its historical use in email, base64 specifies a maximum line length of 76 characters, and since it codes 6 bits at a time and so may not end up on a byte boundary,

it specifies the use of
=
to signify when it doesn't(verify)


Base64 is often one of the easiest choices for embedding binary data in various other things, such as MIME, but also convenient for XML, JSON, strings in code, and others.

Resembles PEM. Based on PEM, in fact.


Example:
Hello
is
SGVsbG8=


Not URL-safe or filename-safe, but there are simple modifications that are, see below.

See also:

Similar systems:

  • PEM
  • OpenPGP's ASCII armor
  • UTF-7 and


Modified Base64 for URLs

Avoids characters that can be finicky in context of URL parsing (=, +, and /), so you can dump the result in URLs with little headache (also usable as filenames, though there's a smaller alteration for it dealing just with /).

Differences from basic Base64:

  • uses no padding (
    =
    in basic Base64)
  • uses
    -
    and
    _
    instead of
    +
    and
    /
...so uses
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_

Modified Base64 for filenames

Avoids /, uses - instead.

...so uses
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+-


Base32

Similar to Base64 but codes only 5 bits into ASCII at a time, using
ABCDEFGHIJKLMNOPQRSTUVWXYZ234567
, and
=
for padding.


Still more efficient than hexadecimal.

Not strictly URL-safe (nor universally filename-safe) because of the =.

Insensitive to case changes.

Example:
Hello
is
JBSWY3DP


See also:

Hexadecimal (Base16)

That is, each byte value as hex, printed as text, so uses
1234567890abcdef
(note that this system is usually case insensitive) For example,
Hello
is
48656c6c6f
.

Less efficient storage-wise than Base64, Base32, but one of the easiest options that is safe almost anywhere (URLs, filesystem names, etc.), and you could easily write the encode and decode code in a few lines.

See also:

Quoted printable

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Can represent any octet with an = followed by its hex-as-ascii value, e.g. =0C

All byte values can and may be coded this way. The equals sign (=) is always encoded (=3D) in the encoded form. Any other value can be used un-encoded, but this provides no safety, which is usually the main point of quopri. Context decides what and (occasionally when) to use the encoded form.


Usually:

  • printable ASCII (0x21..0x7E) and are not encoded (because they're usually safe)
  • end-of-line characters are usually not encoded
  • spaces and tabs are not encoded unless they appear before a newline (where they may not necessarily be safe)


In MIME mail, this can more readable (and space-efficient) than Base64, but only when the data is mostly ASCII text.


Not URL-safe. Not filename-safe unless specific care is taken to encode all unsafe characters (Most filesystems do allow =), though you may run into the filename-length-limit much faster.

Size compared to the original varies with content. Worst case is 300% of original (every single character encoded), while cases where nothing needs to be escaped is close to 100%. So it's only useful as a guarantee of safety, not as a particularly useful means of binary data transfer.


See also:

UUEncoding

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Was mainly used for UUCP. Not used much anymore, because mail is now usually MIME (so Base64), or ocasionally yEnc.

Uses
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
(verify)

See also:

BinHex

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The newer variant is mostly 8-to-6-bit coding, so uses ~133% of the original.


See also

Ascii85

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Sort of a like a Base-85. Codes four bytes in five ASCII characters, so a little more efficient than most at 125% of the original size.

Used in PostScript and PDF.


Has a few mild variations, depending on where it is used (e.g. basic, btoa, and Adobe's).

Based on Base-85, coding these values using:
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
.

Extensions allow use of z for runs of zero values, and y for runs of spaces.

Not URL-safe or filename-safe.

See also:


Z85

Variant of Ascii85, but because the point is that you can dump it into a lot of code and delimited document formats -

E.g. safe to use in JSON, XML, and most code

...but not e.g. filenames.


Uses uses
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-:+=^!/*?&<>()[]{}@%$#

See also

BasE91

Pushing the Base85 idea a little further, for 123% of the original size, over 7-bit channels.

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxuz0123456789!#$%&()*+,./:;<=>?@[]^_`{|}~"
(and notably no
-' \
)

See also

MECE (Multiple Escape Character Encoding)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
Multiple escape characters in BASE 52, using
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP
for coding and
VWXYZ
and
QRSTU
for escapes.

Capable of Unicode.

URL-safe and filename-safe.

Not very byte-efficient, though.

SREC

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

See also:


yEnc

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Used on usenet for files.


It seems this cares only about being MIME-safe and assumes 8-bit clean transfer (which has been usually true for quite a while now).


Uses non-ASCII values, but still a reduced character set (relies on 'Extended ASCII' codepage, so cannot be mixed with other encodings in the same message).


Noticeably more efficient than MIME's Base64, uuencode, and binhex; in many cases the encoded form can be ~102% of the original.

Not a formal standard, and has some flaws that make wide adoption problematic (and even the given application has a few problems).


See also:

Kermit

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The Kermit protocol includes a way to send 8-bit data over 7-bit lines

See also: