Text coding

From Helpful
Jump to: navigation, search

An oversimplified history of text coding

(rewriting)


Codepage notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

General and/or western

  • ASCII
    • a.k.a. X3.4, IEC 646, International Reference Alphabet (IRA) (defined in ITU-T T.50), IA5 (International Alphabet No.5, the older name for IRA), ECMA-6
    • minor details in revisions (consider e.g. X3.4-1968, X3.4-1977, X3.4-1986)
    • Also sometimes called USASCII to distinuish it from more obscure things with ASCII in their name
    • Single-byte. Assigns 0x00 through 0x7F, leaves 0x80-0xFF undefined


  • Latin1, ISO 8859-1
    • one of about a dozen defined in ISO 8859
    • A codepage, in that it extends ASCII to cover about two dozen european languages, and almost covers perhaps a dozen more.
    • was one of the more common around the DOS era(verify)


  • Windows-1252, also codepage 1252, cp1252, sometimes WinLatin1
    • Default codepage for windows in western europe(verify)
    • a superset of Latin1 (which makes WinLatin1 a potentially confusing name)
    • Defines the 0x80-0x9F range (which in Latin1 was technically undefined but often used for C1 control characters) which it uses for some printable characters there instead, including €, ‹ and › (german quotes), „, Œ, œ, Ÿ, ž, Ž, š, Š, ƒ, ™, †, ‡, ˆ, ‰, and some others.

Mostly localized (to a country, or one of various used in a country)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Numbered codepages are often shorthanded with cp, e.g. cp437.

Defined by/for OSes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


  • Mac OS Roman
    • the most common Macintosh coding up to and excluding OSX (which adopted UTF-8)
    • Known to IANA under the MACINTOSH name, which can be a little misleading as there are several different Apple codings from about the same time.


  • Windows-1252
    • also known as WinLatin1 (which can be confusing, because it conflicts with Latin1, if only on one point)
    • a superset of the basic ISO/IEC 8859-1 (a.k.a. Latin1) set, though strictly conflicting, in that it uses characters in the C1 set (which are invalid in 8859-1).
  • Windows-1251 / cp1251
basically the cyrillic variant of Windows-1252
  • Windows-1250 / cp1250
basically the eastern european variant of cp1252
  • Windows-1253 / cp1253
basically the Greek variant of cp1252
  • Windows-1254 / cp1254
basically the Turkish variant of cp1252

Semi-sorted

  • The 7-bit character set defined in GSM 03.38 (...section 6.2.1) [1] (sometimes abbreviated 'GSM')



See also (codepages)

Unsorted:

Unicode

tl;dr: Unicode:

  • Unicode itself is a character set, an enumeration (called codepoints) of characters.
intends to code characters, not glyphs.
  • covers the script of all known live languages and many dead ones
  • defines a number of different byte codings to use in transmission
Perhaps unicode's most useful property is the most boring one: that its interpretation does not depending on context, or serialization.


A central authoritative list of all the characters there are, with stable numbering for them (codepoints) is wildly useful. It's also also the most central thing that unicode is.


Unicode is probably the broadest character map, and a fairly thoughtful one at that.

Unicode has in fact been used to define some new codepages, and has been used to clarify older codepages, particularly their transforms to and from Unicode.


On inception, Unicode was considered 16-bit, and only later made to be open-ended (in theory; the current cap of U+10FFFF actually comes from the same history. There are further rough edges that come from this history, actually).

That cap of U+10FFFF allows 1.1 million codepoints, which is plenty - we're using only a small portion, and have covered most current and historical languages:

approx 110K are usable characters
about 70K of those 110K are CJK ideographs
approx 50K are in the BMP (Basic Multilingual Plane), 60K in higher planes (verify)
an additional ~130K are private use (no character assigned; can be used for local convention, not advised to use to communicate meaning)
approx 900K within the currently-valid range are unassigned
so this is likely to last a long while


The terminology can matter a lot (to the point I'm sure some sentences here are incorrect) because once you get into all the world's language's scripts, you have to deal with a lot of real-world details, from adding accents on other characters, to ligatures, to emoji, to right-to-left languages, to scripts where on-screen glyphs change depending on adjacent characters.

For example:

  • Characters - have meanings (e.g. latin letter a)
  • A glyph has a style.
Many character can appear as fairly distinct-looking glyphs - consider the few ways you can write an a (e.g. cursive, block). Exactly the same function, but different looks.
  • also, some scripts will change the glyph used for a character in specific sequences of code points
  • consider languages that never separate specific characters (e.g. ch in Czech),
  • the concept of text segmentation for e.g. cursor movement - which is sometimes multiple codepoints - a 'grapheme cluster' becomes a more useful concept, see e.g. UAX #29.

To most of us, the takeaway from that is that codepoints tend to be semantic things - even if most of the time and for most of the world they work out as one glyphs put after the other.




What unicode isn't

U+ denotation

A particular Unicode codepoint is basically an integer, and typically denoted using the U+ format that uses hexadecimal, and frequently seen left-padded with zeroes.


For example, the character 32 (a space, taken from ASCII) would be or U+20. Or U+0020.


Zeroes are removed when parsed, so how many zeroes to pad with is mostly personal taste/convention. So U+05 is U+0000005 is U+0005 is U+5.


Code units, codepoints, scalar value

In general use (more so when working on unserialized form of unicode, or thinking we are), we tend to just say 'codepoint', and intuitively mean to communicate what is technically scalar values.


In UTF-8 the serialized and unserialized form are distinct enough that there's no confusion.


In UTF-16, confusion happens because

for appox 60K characters, code units code for points directly
And then the other ~50K codepoints they don't
But particularly English speakers deal primarily with that first set.


Code units aren't a uniquely UTF-16 concept, but are the most interesting in UTF-16.

Because UTF-16 had to become a variable-width serialization, it introduced code units that are not valid codepoints and only used to indirectly code for other codepoints.

For example, the sequence U+D83E U+DDC0 is a surrogate pair - two UTF-16 code units, that codes for the single U+1F9C0 codepoint. You will never see U+D83E or U+DDC0 by itself in a UTF-16. Not in a well-formed one, anyway.


But nobody much uses that terminology that precisely, and we often use the same notation we generally associate with codepoints for UTF-16 code units as well, which causes confusion as it seems to blur the lines between encoded form and 'pure', scalar-value codepoints.

And you will run into this a lot in programming.



More technically

Relevant terms - see https://www.unicode.org/glossary/

  • Code Point[2] - Any value in the Unicode codespace; i.e. the range of integers from 0 to 0x10FFFF.
  • Code Unit[3] - "minimal bit combination that can represent a unit of encoded text for processing or interchange"
e.g. UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, UTF-32 uses 32-bit code units
  • Scalar value[4] - all codepoints except surrogates.

Storing, altering

So, say, you now know that U+2164 (Ⅴ) is roman numeral V, so how do you communicate it?


A lot of document formats opt for UTF8, or UTF16.

And are either defined to use that always, or mark what they use within the document.


Some cases have further options. For example most HTML and most XML allow three options: numeric entities like é or é, named entities like é (except it's a smallish list, and XML doesn't define those), and raw bytestrings according to the document encoding, e.g. that example would be 0xC3 0xA9 in UTF8


Dealing with unicode in programming is a more complex topic.

In-memory, UCS4 (just the codepoint as plain integers) is simplest for you to deal with, but there are practical reasons (storage space) and historical reasons (like Windows APIs initially being UCS2 so retrofitting it to UTF-16 was easier) some implementations actually opt for UTF-16 instead.

UTF-16 makes operations altering unicode strings more complex - even just counting codepoints means you have to decode the string. But it's still doable, and done. And you'ld probably be using libraries anyway.


And most operating systems and most programming languages worth mentioning support Unicode at some level, relieving programmers from having to be obsessive-compulsive experts to get it right. Still, you have to know approximately what you're doing.




UCS

UCS (in itself, without a number) refers to the Universal Character Set - abstractly to the character set that Unicode defines.


In programming you more usually see references to UCS2 or UCS4, ways of storing non-encoded codepoints in fixed-size elements (16-bit and 32-bit uints). (and from the perspective of bytes, you are using one of UCS2BE, UCS2LE, UCS4LE, or UCS4BE)

Unicode libraries implementations often use an array of integers, in the machine's native endianness for speedy operations and simplicity, where each integer is either 2 bytes large (UCS2) or 4 bytes large (UCS4) (the latter of which supports all of unicode, the former only the BMP).

Various common operations are much simpler in this format than it is in encoded data (than in UTF), including finding the amount of characters, doing bidirectional movement with a string, comparing two strings, taking a substring, overwriting characters, and such.

Yes, you could back an implementation with e.g. UTF16 instead, which is more memory-efficient, but the logic is a little slower, and much hairier. But it's doable, and done in practice.


UTF

UTF (Unicode Transformation Format) refers to various variable-length encoding of unicode.

The two most common flavours are UTF-8 and UTF-16 (Others exist, like UTF-32 and UTF-7, but these are of limited practical use).


Because this is a variable-bytes coding, only left-to-right movement in these strings is simple. Another implication is that various operations (see above) are more work, and often slower.


Note that UTF-16 is a superset of UCS2, and that UTF-16 only really adds surrogate pairs.

So it is is backwards compatible -- but creates some footnotes in the process.

Windows before 2000 used UCS2, Windows 2000 and later use UTF-16 to support all characters and break (almost) no existing code. That is, interpreting UTF-16 as UCS2 won't fail, it just limits you to ≤U+FFFF.

Older software seeing surrogates would just discard them (or make them U+FFFD) for being non-allocated characters. It wouldn't display, but you can't expect that from software from a time these character ranges didn't exist.



Limits

Note that while UTF-8 could code codepoint values up to 231, UTF-16's can only code up to 220 with its limited amount of surrogates, which is part of the reason for the codepoint cap at U+10FFFF (and is not likely to move for a while, as UTF-16 is at the core of a bunch of Unicode implementations).

UCS2 is limited to U+FFFF

UTF-16 is limited to U+10FFFF (only so many 2-codepoint surrogates)

UTF-8, UCS4, and UTF-32 can go up to at least 231 in theory but, but currently everything keeps under the overall cap imposed via UTF-16



Space taken

UTF-8 uses

  • 1 byte up to U+7F
  • 2 bytes for U+80 through U+7FF
  • 3 bytes for U+800 through U+FFFF
  • 4 bytes for U+10000 through U+10FFFF
  • It was designed to code higher codepoints, with 5-byte and 6-byte sequences, but since we've not used more than ~30% of the space we have under the current cap, this is unlikely to happen anytime remotely soon.

Note that:

  • Most ASCII is used as-is in Unicode
particularly the printable part, so printable ASCII is itself valid UTF-8 bytes
exceptions are control characters (those below U+20). Which are not really part of ASCII. They exist, though.
  • ...also meaning english encoded in UTF-8 is readable enough by humans and other Unicode-ignorant systems, and could be edited in most text edtors as long as it leaves the non-ASCII bytes alone.
  • All codepoints above U+80 are coded with purely non-ASCII values, and zero bytes do not occur in UTF8 bytes or most ASCII text, meaning that
you can use null-terminated C-style byte strings to store UTF8 strings
C library string functions won't trip over UTF-8, so you can get away with carrying them that way
...just be aware that unicode-ignorant string altering may do bad things


UTF-16 uses

  • 2 bytes for U+0000 through U+FFFF
  • 4 bytes for U+10000 through U+10FFFF (by using pairs of surrogates to code > U+FFFF). Surrogates as currently defined are also the main reason for that U+10FFFF cap)


UTF-32 stores all characters unencoded, making is superficially equivalent to UCS4 (It isn't quite: UCS is a storage-ambivalent enumeration, UTF is a byte coding with unambiguous endianness)


If you want to minimize stored size/transmitted bandwidth, you can choose between UTF-8 or UTF-16 based on the characters you'll mostly encode.

the tl;dr is that

UTF-8 is shortest when you store mostly western alphabetics
because characters under U+800 are coded with 2 bytes, and most western-ish alphabetic languages almost exclusively use codepoints under it - and then mostly ASCII, coded in a single byte, so this will typically average less than 2 bytes per character.
while UTF-16 is sometimes better for CJK and mixed content
UTF-16 uses exactly 2 bytes for everything below U+FFFF, and 4 bytes for everything above that point.
so in theory it'll average somewhere between 2 and 4 for UTF-16
in practice, the most common CJK characters are in the U+4E00..U+9FFF range, so you may find it's closer to 2 -- while in UTF-8 most are 3 or 4 bytes.

On surrogates

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Surrogates appear UTF-16 serialized form, and are used to allow UTF-16 to represent the U+10000 to U+10FFFF range of codepoints.

Surrogates are values in the range 0xD800–0xDFFF (2048 values)


These values should appear only in UTF-16 serialized form.

They should not appear in UCS (i.e. those values are not valid codepoints) -- but the fact that some unicode implementations are UTF-16 rather than UCS4 can make that distinction somewhat vague and confusing in practice.

They should not appear in other coded forms. For example, UTF-8 as a can code for these values in theory, but because of the 'shouldn't appear in UCS', UTF-8 specifically shouldn't ever deal with them, and and UTF-8 encoders and decoders should give errors rather than pass them through. (This is good behaviour because double-encoding UTF codings is probably never a good idea)


Surrogate values are split into two ranges of 1024:

High surrogates are in the range of 0xD800 .. 0xDBFF
Low surrogates are in the range of 0xDC00 .. 0xDFFF


Surrogates are only valid in pairs, where the first must always come from the high range and the second always from the low range.

It's relatively simple math combining some bits. Because each half of a surrogate pair contributes 10 bits to the value, they can only code a 20-bit value (approx a million), and since it codes values above 0xFFFF, surrogates can represent 0xFFFF+1 to 0xFFFF+0x100000(verify), which is most of the reason for the current cap at U+10FFFF.


The "pair is high then low" lets us detect data that doesn't seem to be UTF-16 at all, but more importantly, invalid use of surrogates.

And yes, that does happen. For example, Windows's APIs don't check for well-formed UTF-16, so will happily pass through ill formed UTF-16 (e.g. from filesystem filenames) that happened to make it through APIs before.

(UTF-16 decoding should throw away lone surrogates that you see in encoded UTF-16, because that's ill-formed unicode. This is fairly simple code due to the way they're defined: every value in 0xDC00 .. 0xDFFF without a value in 0xD800 .. 0xDBFF before it should go away, every value in 0xD800 .. 0xDBFF without a value in 0xDC00 .. 0xDFFF after it should go away)




Notes:

  • The list of unicode blocks has high surrogates split into regular and "private use surrogates"
this seems to just reflect that the last 128 thousand of the ~million codepoints that UTF-16 surrogates can represent fall in planes 15 and 16, Private Use Area A and B


  • One annoying side effect is that any program which absolutely must be robust against invalid UTF-16 will need its own layer of checking.
In theory, you can ignore this because someone else has a bad implementation and them fixing it is best for everyone.
In practice, erroring out on these cases may not be acceptable.


  • You remember that "UTF-8 should not contain surrogates, lone or otherwise?" Yeah, some implementations do deviate
For example, java uses a slight (but incompatible) variant of UTF-8 called "Modified UTF-8"[5]

[6], used for JNI and for object serialization(verify). Similarly, WTF-8 is designed to allow it to

  • WTF-8 takes the view that it can be more practical to work around other people's poor API implementations, rather than go "technically they broke it, not my problem, lalalala"
it allows representation of invalid UTF-16 (from e.g. windows) so that it can later ask it for the same invalid string



More technical notes

Unicode General Category

Unicode states that each codepoint has a general category (a separate concept from bidirectional category, which is often less interesting)


The general class is capitalized, the detail is in an added lowercase letter.


Letters (L):

  • Lu: Uppercase
  • Ll: Lowercase
  • Lt: Titlecase
  • Lm: Modifier
  • Lo: Other

Numbers (N):

  • Nd: Decimal digit
  • Nl: Letter (e.g. roman numerals like Ⅷ)
  • No: Other (funky scripts, but also subscript numbers, bracketed, etc)

Symbols (S):

  • Sm: Math
  • Sc: Currency
  • Sk: Modifier
  • So: Other

Marks (M):

  • Mn: Nonspacing mark
  • Mc: Spacing Combining mark
  • Me: Enclosing mark

Punctuation (P):

  • Pc: Connector
  • Pd: Dash
  • Ps: Open
  • Pe: Close
  • Pi: Initial quote (may behave like Ps or Pe depending on usage)
  • Pf: Final quote (may behave like Ps or Pe depending on usage)
  • Po: Other

Separators (Z):

  • Zs: Space separator,
  • Zl: Line separator,
  • Zp: Paragraph separator,

Other (C):

  • Cc: Control
  • Cf: Format
  • Cs: Surrogate
  • Co: Private Use
  • Cn: Not Assigned (no characters in the unicode.org list explicitly have this category, but implementations are expected to return it for every codepoint not in that list)


Pay attention to unicode version, since there are continuous additions and corrections. For example, older versions may tell you all non-BMP characters are Cn. I believe Pi and Pf were added later, so beware of testing for too few lowercase variations.


On modifiers and marks: TODO

Normal forms

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Unicode's normal forms are mainly meant to compare codepoints as meaning the same thing (semantically and/or visually), even if they are not the same codepoints.


Unicode defines two types of equivalence: canonical equivalence, and compatibility equivalence.


Canonical equivalence is relative strict semantical equivalence, used to test things that may show up in more than one way. (This usually also implies both will be rendered identically)

Consider é. It can show up in both of the following ways:

  • é is a LATIN SMALL LETTER E WITH ACUTE (U+00E9)
  • é looks the same but is actually two codepoints:
    • LATIN SMALL LETTER E (U+0065) followed by
    • COMBINING ACUTE ACCENT (U+0301)

As far as canonical equivalence is concerned, these two are interchangeable.


Compatibility equivalence are for things you might want to reduce to the same thing, but are not interchangeable.

Now consider the roman numeral Ⅳ (as a single character, U+2163).

In some contexts it may be rather useful to be able to consider it the same as the basic-two-characters string "IV". And in other contexts, specifically not.

Note that:

  • since compatibility equivalence is more of a fuzzy 'does this look about the same', anything that is canonically equivalent is implicitly also compatibility equivalent.
Not the other way around, obviously, which leads to the following point
  • canonical equivalence is a symmetric relation, and compatibility is not.
It is, for example, not at all a general truth that every sequence of IV should become the roman numeral.



Doing tests, and/or keeping the normalized form


Canonical

Considering the above, and the following python3 code:

# canonical decomposition
unicodedata.normalize("NFD", "\u00e9")  == 'e\u0301'

# canonical composition 
unicodedata.normalize("NFC", 'e\u0301') == '\xe9'      

Since this uses only canonical changes, they retain semantic information, which makes it useful, which makes it useful for:

storing in a directly usable and directly comparable way (so that you don't have to decompose each time you compare)
sorting (think collations)
text indexing (sort of like how you might lowercase the query and the text you index, to do case-insensitive search),
operations like removing accents, etc.
operations for compatibility - e.g. the halfwidth and fullwidtth forms include a copy of the basic latin alphabet, mostly for lossless conversions between this and older encodings containing both halfwidth and fullwidth characters (verify)


What they actually do is

  • NFD (Normalization Form Canonical Decomposition) decompose only to canonically equivalent forms. Examples:
Roman character Ⅳ stays Ⅳ
e-egu becomes separate e and egu
  • NFC (Normalization Form Canonical Composition) decomposes canonically, then composes canonically. Examples:
Roman numeral character Ⅳ stays Ⅳ stays Ⅳ
separate e and egu becomes e-egu


Compatibility

Mixing in compatibility equivalence has different uses, including

things that are equivalent enough, but not semantically
for example, there is a ℃ character (U+2103), which has a compatibility decomposition into ° (U+B0) and the letter C. It wouldn't make sense to put that into canonical at all.
people not knowing about specific codepoints, or practical laziness about it
like people type IV when they mean roman numeral Ⅳ; with compatibility transforms you can test them for equivalence via the string "IV", which is meaningful in contexts you know it will be a number.
you wouldn't want to alter a document with compatibility-based transform, as that is often (but not always) destructive in the semantic sense.
allows for even fuzzier search, in that it includes more things that look similar
though note that it often isn't quite fuzzy enough for that use
Say, "Röyksopp" would be indexed as "Royksopp", but "Røyksopp" will not because there is no reason to make ø equivalent to o-and-something.
the confusables data mentions this relation - but that's a different thing
similarly, you might expect the German ß (eszett) to be compatibility equivalent to 'ss', but it is not
the confusables data mostly make it equivalent to beta


The conversions aren't quite like the canonical variants. Consider

  • NFKD (Normalization Form Compatibility Composition) decomposes to compatibility-equivalent forms. Examples:
Roman numeral character Ⅳ becomes two-letter "IV" (unicodedata.normalize("NFKD", "\u2163") == 'IV')
e-egu becomes separate e and egu
  • NFKC (Normalization Form Compatibility Composition) decomposes to compatibility-equivalent forms, then recomposes canonically. Examples:
Example: Roman numeral characterⅣ becomes "IV", then stays "IV".
separate e and egu becomes e-egu

Note that there is no compatibility composition - it would make way too many semantically nonsensensical combinations. For example, it doesn't make sense for IV to ever ever become ℃ again.


Planes

Currently, the range open for plane and characer definitions is capped at U+10FFFF (20 bits), but the definition of UCS defines sixteen planes of 65536 (0xFFFF) characters each (using only about ten percent of the 20-bit range), the first fifteen of which are consecutive.

  • BMP (Plane 0, the Basic Multilingual Plane) stores scripts for most any live language. (most of U+0000 to U+FFFF). UCS2 was created to code just this plane.
  • SMP (Plane 1, the Supplementary Multilingual Plane) mostly used for historic scripts (ranges inside U+10000 to U+1DFFF)
  • SIP (Plane 2, the Supplementary Ideographic Plane) stores mHan-Unified characters (ranges inside U+20000 to U+2FFFF)
  • SSP (Plane 14, the Supplementary Special-purpose plane), contants a few nongraphical things, like language markers (a range in U+E0000 to U+EFFFF)
  • Planes 3 to 13 are currently undefined.
Plane 3 has tentatively been called the Tertiary Ideographic Plane, but (as of this writing) has only tentatively allocated characters
  • Plane 15 (U+F0000–U+FFFFF) and plane 16 (U+100000–U+10FFFF) are Private Use Areas A and B, which you can use for your own font design, non-transportable at least in the semantical sense.
(There is also a private use range in BMP, U+E000 to U+F8FF)

Byte Order Marker (BOM)

In storage you often deal with bytes.


UTF16 is 16-bit ints stored into bytes, so endianness becomes a thing.

It's useful to standardized any file serialization to be a specific endianness, or store its overall endianness, but you don't have to: you can optionally start each UTF-16 string with Byte Order Marker (BOM).

The BOM is character U+FEFF, which once it's bytes works out as either:

  • FE FF: means big-endian UTF-16 (UTF-16BE)
  • FF FE: means little-endian UTF-16 (UTF-16LE)


Notes:

  • U+FEFF was previously also used as a zero-width non-breaking space, so could also appear in the middle of a string.

To separate these function, BOM in that use is deprecated, and zero-width non-breaking space is now U+2060.

  • U+FFFE is defined never to be a character, so that the above test always makes sense
  • Because BOMs are not required to be handled by all Unicode parsing, it is conventional that:
    • The BOM character is typically removed by UTF-16 readers
    • If a file/protocol is defined to be a specific endianness, it will generally not contain the BOM (and may be is defined to never use it)



UTF-32 also knows BOMs: (though they are rarely used)(verify)

  • 00 00 FE FF: big-endian UTF-32
  • FF FE 00 00: little-endian UTF-32

(Note that if you don't know whether it's UTF-32 or UTF-16 (which is bad practice, really), the little-endian UTF32 BOM is indistinguishable(verify) from little-endian UTF-16 BOM and a string starting with a NUL (U+0000) (unlikely, but valid)



UTF-8 is a byte coding, so there is only one order it can have in byte storage.

You do occasionally find the BOM in UTF-8 (in encoded form, as EF BB BF). This has no bearing on byte order since UTF-8 is a byte-level coding already. Its use in UTF-8 is mostly as a signature, also to make a clearer case for BOM detectors, though it is usually pointless, and accordingly rare in practice.

Other Unicode encodings

Since Unicode is its own definition and relatively new, only UTF and other encodings created after Unicode was defined encode Unicode characters. One exception is GB 18030, a chinese standard which can encode all Unicode planes (verify)).


Aside from UTF-8, UTF-16, arguably UTF-32, and GB18030, most of these serve a specific purpose, including:


Compressed formats:

  • SCSU (Standard Compression Scheme for Unicode) is a coding that compresses Unicode, particularly if it only uses one or a few blocks.
  • BOCU-1 (Binary Ordered Compression for Unicode) is derived from SCSU

These have limited applicability. They are not too useful for many web uses, since for content larger than a few kilobytes, generic compression algorithms compress better.

SCSU cannot be used in email messages because it allows characters that should not appear inside MIME. BOCU-1 could, but would require wide client support and is unlikely to overtake UTF.


Codings for legacy compatibility:

  • UTF-7 does not use the highest bit and is safe for ancient 7-bit communication lines
  • UTF-EBCDIC


(TODO: look at:)

  • CESU-8
  • Punycode

More practical notes

Note: Some of this is very much my own summarization, and may not apply to the contexts, systems, pages, countries and such most relevant to you.


On private use

On Asian characters

Beyond plain visible characters

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

There are a few dozen codepoints officially labeled <not a character>, which don't code anything. This e.g. includes

  • the last two codepoints in each plane
e.g. U+FFFE and U+FFFF on the BMP, which are used for BOM use
  • U+FDD0..U+FDEF
apparently "to make additional codes available to programmers to use for internal processing purposes" [7]. You can think of this as a different type of private use
unrelated to the Arabic Presentation Forms-A range it is in, it just had space unlikely to be used


There are a lot of other things that are allocated codepoints, but are not / do not contribute to visible glyphs.

This is not unicode definition, just for my own overview.

  • Control characters (definitions from before Unicode)
    • 'C0 range', referring to the ASCII control characters U+00 to U+1F
    • 'C1 range', U+80-U+9F.
    • U+7F (DEL) is often also usually considered a control character.
  • Surrogates (U+D800 to U+DFFF in two blocks)
in that seen in isolation, they code no character (and make the string ill-formed)
Their real purpose is to be able to code higher-than-BMP codepoints in UTF-16 (including various UTF16-based Unicode implementations)


  • Variations
    • Variation Selectors FE00 to FE0F - used to select standardized variants of some mathematical symbols, emoji, CJK ideographs, and 'Phags-pa [8]
    • Variation Selectors Supplement U+E0100 to U+E01EF [9]
See the Ideographic Variation Database[10][11]
short summary: Emoji uses VS16 (U+FE0F) (and sometimes VS15 U+FE0E to force non-emoji representation); non-CJK sometimes uses VS1 U+FE00; CJK uses VS1, VS2, VS3; VS4 through VS14 are not in use (as of Unicode 11)
Originally intended to tag languages in text (See also RFC 2482), this was deprecated from Unicode 5 on
Unicode 8 and 9 cleared the deprecated status (for all but U+E0001) to allow some other use in the future
https://en.wikipedia.org/wiki/Tags_(Unicode_block)


Combining characters Combining characters - visible, but need another character to go on


See also:

Double-up encoding: Â, Ã, etc.

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

When you take a text coding that was already encoded into bytes, then think they are codepoints (which happen to all be in the U+00 .. U+FF range) the encode that using UTF-8 (or something else), you get mangled data.


There's many variations on that theme, but the most common are probably

UTF-8 encoding Latin1/CP-1252 byte data, and
UTF-8 encoding UTF-8 byte data.


You'll often see character sequences starting with  (U+C2) or à (U+C3), followed by some other accented latin characters or some symbol.

This because many western codepoints lie in the U+0080 through U+07FF range (and mostly at the start of that), which encode to two-byte UTF8 sequences starting with 0xC2 through 0xDF.

If those are presented via a codepage, where every byte is a character, then e.g. in you end up with Â, Ã, Ä, Å, Æ, Ç, È, É, through ß (these happen to also be U+C2, U+C3, etc. because Unicode chose to adopt cp1252's characters for much of the u+a0..U+ff range).


There's a few cases you'll see more often. For example, various word processers will rewrite apostrophes around words as ’ (U+2018, RIGHT SINGLE QUOTATION MARK) and similar, which is e2 80 99 in UTF8, which is ’ when interpreted as CP-1252

http://www.i18nqa.com/debug/utf8-debug.html


In a good number of cases you can un-mangle this.

It involves some detection of what the first encoding was, though.


Emoji

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Types of emoticons / emoji (wider view than unicode)

Short history:

  • Emoticons originally mostly referred to combinations of ASCII, mostly simple smilies like
    :)


  • ...plus other characters we already had (unicode or not), most of them local ones, e.g.
Japanese kaomoji which started as ASCII like (*_*) (also seen in the west) but now have variants like (´。• ω •。`)
Korean style (like ㅇㅅㅇ),
down to patterns just seen in one place more than another like ó_ò in brazil, ಠ_ಠ initially on forums, etc.


  • some sites, mostly message boards, would replace specific text (like
    :)
    ) with images.
Often mostly the simpler ASCII smilies, but some more extensive
  • The same boards are also known for adding further shortcodes, specific strings like
    :bacon:
    as unique test markers to always replace with images
more flexible, sometimes board-specific - and for that same reason is less widespread


  • Later, a decent number of well used emoji were added to Unicode (most after 2010)
which would render as-is assuming you have a font, however...
various sites/devices do image replacement for these Unicode, to get color and their specific style.
to the point that this is part of their text editors


Wider support of emoji in Unicode emoji made things more interesting (for a few different reasons).

Image replacement of Unicode was initially more often HTML-level replacement of a character with an img tag. These days the image replacement is often more integrated, where an app and/or phone input fields and/or browser(verify) shows the replacement as you type.

The same browsers/apps will also often provide decent emoji keyboards/palletes to make it much easier to enter them in the first place.

This kind of support is fairly recent development (see e.g. http://caniemoji.com/), and getting pretty decent.

You can consider this more native support of emoji -- except also somewhat misleading. What gets sent is not the image you see but the unicode underlying it. And because there's now at least a dozen sets of images, it may look different in another app.

More on image replacement

App can choose to replace creative-use-of-character emoji, of shortcodes, and/or unicode.

More recently it's mostly shifted to unicode emoji (probably because of the input palettes).


Apps often have their own set of image replaement. The amount of characters any of these replace varies.


There are also a good number of apps that either use the platform-native set, or specifically adopted one of these.


Emoji sets include:

Google [12] [13]
Microsoft [14]
Apple [15]
Samsung [16]
HTC [17]
LG [18]
WhatsApp [19]
Twitter [20], also used by e.g. Discord
Snapchat [21]
Facebook [22]
Mozilla [23]
EmojiOne [24] - licensable set used e.g. by Discourse, Slack
GitHub [25]
GMail - had a smaller set (used in subject lines, often square-looking)


Emoji according to unicode

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

While a bunch of emoticon-like characters have been around for very long, the first bulk introduction of Emoji characters came in Unicode 6.0 (~2010).


Standardized Emoji data came later with Emoji 1 (~2015) which started its own versioning (1, 2, 3, 4, 5, 11, 12, 13 - the version jump comes from a choice to synchronize Emoji version with Unicode version).



Note that brand new emoji often do not immediately have a representation in browser/device for a little bit (and font support generally lags much more).


See also:


Flags

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Flags do not have singular codepoints - this would be slightly awkward to keep up to date due to changing geopolitics.


When you restrict yourself to countries, things stay relatively simple.

Unicode 6 added an range of letters used only as regional indicators (U+1F1E6 through U+1F1FF, within Enclosed Alphanumeric Supplement).

It comes down to "use these in pairs to refer to ISO 3166-1 alpha-2 code"[26].

So, for example, 🇧🇪 is actually the sequence of

U+1F1E7 REGIONAL INDICATOR SYMBOL LETTER B
U+1F1EA REGIONAL INDICATOR SYMBOL LETTER E

Doing it this way means

there's just one authoritative place to keep up to date.
changes in countries is up to implementations, not definition within Unicode.
...flags can show up differently in a lot of places.



One practical question is "where do you stop with regions?"

The above is an answer, but not a fully satisfying one, shown e.g. by the UK.

The UK is complex to start with. Sure, in the above list, GB is the United Kingdom of Great Britain and Northern Ireland, and its flag is 🇬🇧

But this is one example where the parts of that state are self-governing enough that they certainly use their own flag, which they also had before they joined. England, Scotland, and Wales are not countries according to that list, anyway. But the Isle of Man is. Yet e.g. Tristan da Cunha is not. The reason that one region that the UK is responsible for (because it is) is so different from another will be lost on most people.