Talk:Text coding

From Helpful
Jump to navigation Jump to search


Simple to handle, simple to navigate.

  • 'ASCII: byte-per-character, values 0-127
  • 'Character sets: Give 128-255 meaning. There are dozens of these


There are some early encodings that received some library support, but usually the strings are hard to navigate, and more importantly, they were still limited in the characters they defined - they were basically frankensteinian codepages.

Frankly, the only thing that can support all characters you will need is unicode, which in the abstract sense is an open-ended assignment of glyphs to codepoints, here a fancy name for 'integers'.

Open ended seems to have been the original idea, but seems to have been a lie, along with a 32-bit range. It seems Unicode currently formally ends at codepoint 0x10FFFF - though it is not clear to me whether this only applies to UTF-encoded strings or to UCS as well.

Note that there are in fact two standards - unicode and ISO's 10646

However, unicode is not directly an encoding. You have the option of:

  • UCS-4: Cutting the options at 2^32-1 (likely to include everything, ever).
  • UCS-2: Cutting the options at 2^16-1 (some dead-language characters fall outside this)
  • UTF-32: Defined after UCS-4, and seems to have become fully equivalent since the 0x10FFFF limit.
  • UTF-16LE, UTF-16BE, UTF-8: p

Nitpicker's notes:

  • Numbers in UCS names refers to butes, UTF to bits. Since this has only limited meaning, a lot of people say UCS16 and UCS32 too.
  • There seems to be no concensus in whether the dashes should be there or not. It seems they are usually present with UCS and absent in UTF, possibly for skimmability.

UCS is fixed-byte, UTF variable byte. Also, UTF is guaranteedly transportable, UCS is not. That is:

  • UTF8 is exactly defined, byte-order-wise
  • There is no such thing as UTF16, it is a grouping abstraction. Only UTF16LE and UTF16BE exist (little and big endian). Same goes for UTF32
  • Technically there are UCS2BE, UCS2LE, UCS4LE and UCS4LE, however:
    • ...since UCS is rarely used to transport data (and when it is, the endianness is exactly defined), most unicode implementations platform integers and therefore the machine's endianness, which is the thing that is not binarily transportable.
    • ...which usually means the only byte data they provide is an UTF flavour translated from that UCS.

More notes:

  • When windows adopted unicode, they adopted UCS2 by choosing a short int (16-bit) for the character type. In fact, wchar_t is still 16-bit on C++ .NET, while it is 32-bit on unices/gcc.
  • Some languages, usually older ones, put variable-byte character strings in byte-per-character strings. This is asking for trouble as you then need string functions per encoding; PHP's general string functions are single-byte (ASCII/codepage), and it provides a semi-useful subset in which you can specify the encoding.