Text coding

From Helpful
(Redirected from Surrogates)
Jump to navigation Jump to search

An oversimplified history of text coding

🛈 I am ignoring a bunch of history that I am aware of, and probably more that I am not aware of

The point here is not completeness, but

to give you some context,
a "this is how we used to do it, and this is why we (mostly) stopped doing it that way",
...and then skip to a view for modern programmers, probably the only people who would really care much about these details to start with.

If you're interested in the details, there are better and longer histories, written by more knowledgeable people.




Codepage notes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

General and/or western

  • ASCII
    • a.k.a. X3.4, IEC 646, International Reference Alphabet (IRA) (defined in ITU-T T.50), IA5 (International Alphabet No.5, the older name for IRA), ECMA-6
    • minor details in revisions (consider e.g. X3.4-1968, X3.4-1977, X3.4-1986)
    • Also sometimes called USASCII to distinuish it from more obscure things with ASCII in their name
    • Single-byte. Assigns 0x00 through 0x7F, leaves 0x80-0xFF undefined


  • Latin1, ISO 8859-1
    • one of about a dozen defined in ISO 8859
    • A codepage, in that it extends ASCII to cover about two dozen european languages, and almost covers perhaps a dozen more.
    • was one of the more common around the DOS era(verify)


  • Windows-1252, also codepage 1252, cp1252, sometimes WinLatin1
    • Default codepage for windows in western europe(verify)
    • a superset of Latin1 (which makes WinLatin1 a potentially confusing name)
    • Defines the 0x80-0x9F range (which in Latin1 was technically undefined but often used for C1 control characters) which it uses for some printable characters there instead, including €, ‹ and › (german quotes), „, Œ, œ, Ÿ, ž, Ž, š, Š, ƒ, ™, †, ‡, ˆ, ‰, and some others.

Mostly localized (to a country, or one of various used in a country)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Numbered codepages are often shorthanded with cp, e.g. cp437.

Defined by/for OSes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

The list is much longer, but some better known names include:


  • Mac OS Roman
    • the most common Macintosh coding up to and excluding OSX (which adopted UTF-8)
    • Known to IANA under the MACINTOSH name, which can be a little misleading as there are several different Apple codings from about the same time.


  • Windows-1252
    • also known as WinLatin1 (which can be confusing, because it conflicts with Latin1, if only on one point)
    • a superset of the basic ISO/IEC 8859-1 (a.k.a. Latin1) set, though strictly conflicting, in that it uses characters in the C1 set (which are invalid in 8859-1).
  • Windows-1251 / cp1251
basically the cyrillic variant of Windows-1252
  • Windows-1250 / cp1250
basically the eastern european variant of cp1252
  • Windows-1253 / cp1253
basically the Greek variant of cp1252
  • Windows-1254 / cp1254
basically the Turkish variant of cp1252

Semi-sorted

  • The 7-bit character set defined in GSM 03.38 (...section 6.2.1) [1] (sometimes abbreviated 'GSM')



See also (codepages)

Unsorted:

Unicode

What unicode is

tl;dr: Unicode:

  • Unicode itself is a character set, an map of codepoints to characters.
intends to code characters, not glyphs (see also what unicode isn't, below)
  • ...plus a bunch of cases where codepoints combine to show fewer characters - see also grapheme clusters
  • Unicode covers the script of all known live languages, many dead ones, and a number of special purposes
(it's starting to get into semiotic questions - what sort of signs are understood even if they're not language, what do we do with fictional languages, etc.)
  • also, unicode defines a number of different byte codings to use in transmission
Perhaps unicode's most useful property is the most boring one: that its interpretation does not depending on context, or serialization.



More practically

So practically it's a stable list of all the characters there are.

Unicode is probably the broadest character map, and a fairly thoughtful one at that (nothing's perfect - language isn't free of politics, for one, asian characters and Han unification have had hiccups, emoji seem to have a somewhat random bar for entry, and there are a various other points of argument. That said, there is nothing else that comes even close to doing this job well that isn't also extremely local).

Unicode has in fact been used to define some new codepages, has been used to clarify older codepages, particularly their transforms to and from Unicode.



More technically

On inception, Unicode was considered 16-bit, and only later made to be open-ended in theory, though there is a current cap at U+10FFFF (this and some further rough edges actually comes from the same history).

That cap of U+10FFFF allows 1.1 million codepoints, which is plenty - we're using only a small portion, and have covered most current and historical languages:

approx 110K are usable characters
about 70K of those 110K are CJK ideographs
approx 50K are in the BMP (Basic Multilingual Plane), 60K in higher planes (verify)
an additional ~130K are private use (no character/meaning assigned, and won't be; can be used for local convention -- generally not advised to use to communicate anything with meaning, though)
approx 900K within the currently-valid range are unassigned

...and that's after we've covered most languages we know of, so this is likely to last a long while.


The terminology can matter a lot (to the point I'm fairly sure some sentences here are incorrect) because once you get into all the world's language's scripts, you have to deal with a lot of real-world details, from adding accents on other characters, to ligatures, to emoji, to right-to-left languages, to scripts where on-screen glyphs change depending on adjacent characters. (some of that is squarely the domain of fonts and not of characters, but Unicode design sometimes still has to be aware of that)

For example:

  • Characters - have meanings (e.g. latin letter a)
  • A glyph has a style.
Many character can appear as fairly distinct-looking glyphs - consider the few ways you can write an a (e.g. cursive, block). Exactly the same function, but different looks.
  • also, some scripts will change the glyph used for a character in specific sequences of code points
  • consider languages that never separate specific characters (e.g. ch in Czech)
  • the concept of text segmentation for e.g. cursor movement - which is sometimes multiple codepoints - a 'grapheme cluster' becomes a more useful concept, see e.g. UAX #29.


To most of us, the takeaway from that is that codepoints tend to be semantic things - even if most of the time and for most of the world they work out as one glyph put after the other.


On top of the main standard, there are also

UAX (Unicode Standard Annex) are basically parts of the standard that happen to be described in separate documents
for example bidirectional behaviour is a subject useful to detail on its own
UTS (Unicode Technical Standards) are basically things that implementations can optionally conform to,
e.g. collocation, some security considerations, some of the emoji processing, a compressed encoding
UTR (Unicode Technical Reports) are further informative material.
some of these are just helpful,
others are closer to UTSes - For example, Emoji (TR #51) started as an UTR and now is a UTS

more on their status list of them


What unicode isn't

U+ denotation

A particular Unicode codepoint is basically an integer, and typically denoted using the U+ format that uses hexadecimal, and frequently seen left-padded with zeroes.


For example, the character 32 (a space, taken from ASCII) would be or U+20. Or U+0020.


Zeroes are removed when parsed, so how many zeroes to pad with is mostly personal taste/convention. So U+05 is U+0000005 is U+0005 is U+5.


Code units, codepoints, scalar value

In general use (more so when working on unserialized form of unicode, or thinking we are), we tend to just say 'codepoint', and intuitively mean to communicate what is technically scalar values.


In UTF-8 the serialized and unserialized form are distinct enough that there's no confusion.


In UTF-16, confusion happens because

for appox 60K characters, code units code for points directly
And then the other ~50K codepoints they don't
But particularly English speakers deal primarily with that first set.


Code units aren't a uniquely UTF-16 concept, but are the most interesting in UTF-16.

Because UTF-16 had to become a variable-width serialization, it introduced code units that are not valid codepoints and only used to indirectly code for other codepoints.

For example, the sequence U+D83E U+DDC0 is a surrogate pair - two UTF-16 code units, that codes for the single U+1F9C0 codepoint. You will never see U+D83E or U+DDC0 by itself in a UTF-16. Not in a well-formed one, anyway.


But nobody much uses that terminology that precisely, and we often use the same notation we generally associate with codepoints for UTF-16 code units as well, which causes confusion as it seems to blur the lines between encoded form and 'pure', scalar-value codepoints.

And you will run into this a lot in programming.



More technically

Relevant terms - see https://www.unicode.org/glossary/

  • Code Point[2] - Any value in the Unicode codespace; i.e. the range of integers from 0 to 0x10FFFF.
  • Code Unit[3] - "minimal bit combination that can represent a unit of encoded text for processing or interchange"
e.g. UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, UTF-32 uses 32-bit code units
  • Scalar value[4] - all codepoints except surrogates.



Storing, altering

So, say, you now know that U+2164 (Ⅴ) is roman numeral V, so how do you communicate it?


A lot of document formats opt for UTF8, or UTF16.

And are either defined to use that always, or mark what they use within the document.


Some cases have further options. For example most HTML and most XML allow three options: numeric entities like é or é, named entities like é (except it's a smallish list, and XML doesn't define those), and raw bytestrings according to the document encoding, e.g. that example would be 0xC3 0xA9 in UTF8


Dealing with unicode in programming is a more complex topic.

In-memory, UCS4 (just the codepoint as plain integers) is simplest for you to deal with, but there are practical reasons (storage space) and historical reasons (like Windows APIs initially being UCS2 so retrofitting it to UTF-16 was easier) some implementations actually opt for UTF-16 instead.

UTF-16 makes operations altering unicode strings more complex - even just counting codepoints means you have to decode the string. But it's still doable, and done. And you'ld probably be using libraries anyway.


And most operating systems and most programming languages worth mentioning support Unicode at some level, relieving programmers from having to be obsessive-compulsive experts to get it right. Still, you have to know approximately what you're doing.




UCS

UCS (in itself, without a number) refers to the Universal Character Set - abstractly to the character set that Unicode defines.


In programming you more usually see references to UCS2 or UCS4, ways of storing non-encoded codepoints in fixed-size elements (16-bit and 32-bit uints). (and from the perspective of bytes, you are using one of UCS2BE, UCS2LE, UCS4LE, or UCS4BE)

Unicode libraries implementations often use an array of integers, in the machine's native endianness for speedy operations and simplicity, where each integer is either 2 bytes large (UCS2) or 4 bytes large (UCS4) (the latter of which supports all of unicode, the former only the BMP).

Various common operations are much simpler in this format than it is in encoded data (than in UTF), including finding the amount of characters, doing bidirectional movement with a string, comparing two strings, taking a substring, overwriting characters, and such.

Yes, you could back an implementation with e.g. UTF16 instead, which is more memory-efficient, but the logic is a little slower, and much hairier. But it's doable, and done in practice.


UTF

UTF (Unicode Transformation Format) refers to various variable-length encoding of unicode.

The two most common flavours are UTF-8 and UTF-16 (Others exist, like UTF-32 and UTF-7, but these are of limited practical use).


UTF-8 and UTF-16 are both variable-bytes coding, which means only left-to-right movement in these strings is simple, and you technically can't seek arbitrarily without decoding everything before (but for both there are ways to do so if you can assume it is a well-formed string(verify)).

Another implication is that various operations (see above) are more work, and often a little slower than unencoded form.


Note that UTF-16 is a superset of UCS2, and that the main addition of UTF-16 over UCS2 is surrogate pairs.

So it is is backwards compatible -- but creates some footnotes in the process.

Windows before 2000 used UCS2, Windows 2000 and later use UTF-16 -- to support all characters and break (almost) no existing code. That is, interpreting UTF-16 as UCS2 won't fail, it just limits you to ≤U+FFFF.

Older software seeing surrogates would just discard them, or make them U+FFFD) for being non-allocated characters. It won't display, but you can't really expect that from software from a time these character ranges didn't exist. More important is that it doesn't fail.


Limits

UCS2 is limited to U+FFFF

UTF-16 is limited to U+10FFFF (only so many 2-codepoint surrogates)

UTF-8, UCS4, and UTF-32 can go up to at least 231 in theory but, but currently everything keeps under the overall cap imposed via UTF-16:

while UTF-8 as an algorithm could code codepoint values up to 231, UTF-16's currently defined surrogates can only code up to 220.

Since UTF-16 is at the core of a bunch of Unicode implementations, and we haven't remotely filled the codepoints we have, that cap is unlikely to change any time soon.


Space taken

UTF-8 uses

  • 1 byte up to U+7F (which is mostly ASCII as-is, which is intentional)
  • 2 bytes for U+80 through U+7FF
  • 3 bytes for U+800 through U+FFFF
  • 4 bytes for U+10000 through U+10FFFF
  • It was designed to code higher codepoints, with 5-byte and 6-byte sequences, but this is currently outside the cap, and we are unlikely to extend that anytime soon, as we've not used more than ~30% of the space we have.


Note that:

  • Most ASCII is used as-is in Unicode
particularly the printable part, so printable ASCII is itself valid UTF-8 bytes
exceptions are control characters (those below U+20). Which are not really part of ASCII. They exist, though.
  • ...also meaning english encoded in UTF-8 is readable enough by humans and other Unicode-ignorant systems, and could be edited in most text edtors as long as it leaves the non-ASCII bytes alone.
  • All codepoints above U+80 are coded with purely non-ASCII values, and zero bytes do not occur in UTF8 bytes or most ASCII text, meaning that
you can use null-terminated C-style byte strings to store UTF8 strings
C library string functions won't trip over UTF-8, so you can get away with carrying them that way
...just be aware that unicode-ignorant string altering may do bad things


UTF-16 uses

  • 2 bytes for U+0000 through U+FFFF
  • 4 bytes for U+10000 through U+10FFFF (by using pairs of surrogates to code > U+FFFF). Surrogates as currently defined are also the main reason for that U+10FFFF cap)


UTF-32 stores all characters unencoded, making is superficially equivalent to UCS4 (It isn't quite: UCS is a storage-ambivalent enumeration, UTF is a byte coding with unambiguous endianness)


If you want to minimize stored size/transmitted bandwidth, you can choose between UTF-8 or UTF-16 based on the characters you'll mostly encode.

the tl;dr is that

UTF-8 is shortest when you store mostly western alphabetics
because characters under U+800 are coded with 2 bytes, and most western-ish alphabetic languages almost exclusively use codepoints under it - and then mostly ASCII, coded in a single byte, so this will typically average less than 2 bytes per character.
while UTF-16 is sometimes better for CJK and mixed content
UTF-16 uses exactly 2 bytes for everything below U+FFFF, and 4 bytes for everything above that point.
so in theory it'll average somewhere between 2 and 4 for UTF-16
in practice, the most common CJK characters are in the U+4E00..U+9FFF range, so you may find it's closer to 2 -- while in UTF-8 most are 3 or 4 bytes.

On surrogates

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Surrogates appear UTF-16 serialized form, and are used to allow UTF-16 to represent the U+10000 to U+10FFFF range of codepoints.

Surrogates are values in the range 0xD800–0xDFFF (2048 values)


These values should appear only in UTF-16 serialized form.

They should not appear in UCS (i.e. those values are not valid codepoints) -- but the fact that some unicode implementations are UTF-16 rather than UCS4 can make that distinction somewhat vague and confusing in practice, so in the messy real world, this mixup does sometimes happen.


They should also not appear in other coded forms. For example, to UTF-8 it's just numbers it can code perfectly fine, but because of the 'shouldn't appear in UCS', UTF-8 should give errors rather than pass them through - preferable behaviour because double-encoding UTF codings is probably never a good idea.


Surrogate values are split into two ranges of 1024:

High surrogates are in the range of 0xD800 .. 0xDBFF
Low surrogates are in the range of 0xDC00 .. 0xDFFF


Surrogates are only valid in pairs, where the first must always come from the high range and the second always from the low range.

Which each contribute 10 bits to en overall value in a relatively simple way. Meaning surrogates can only code a 20-bit value.

UTF-16 surrogates are used to represent codepoints 0xFFFF+1 to 0xFFFF+0x100000. This seems most of the reason the unicode cap is at U+10FFFF (1114111 in decimal).


The "pair is high then low" lets us detect data that doesn't seem to be UTF-16 at all, but more importantly, invalid use of surrogates.

And yes, that does happen. For example, Windows's APIs don't check for well-formed UTF-16, so will happily pass through ill formed UTF-16 (e.g. from filesystem filenames) that somehow made it past some API that didn't check earlier.

(UTF-16 decoding should throw away lone surrogates that you see in encoded UTF-16, because that's ill-formed unicode. This is fairly simple code due to the way they're defined: every value in 0xDC00 .. 0xDFFF without a value in 0xD800 .. 0xDBFF before it should go away, every value in 0xD800 .. 0xDBFF without a value in 0xDC00 .. 0xDFFF after it should go away)


Notes:

  • The list of unicode blocks has high surrogates split into regular and "private use surrogates"
this seems to just reflect that the last 128 thousand of the ~million codepoints that UTF-16 surrogates can represent fall in planes 15 and 16, Private Use Area A and B


  • One annoying side effect is that any program which absolutely must be robust against invalid UTF-16 will probably want its own layer of checking on top of whatever the language provides.
In theory, you can ignore this because someone else has a bad implementation and them fixing it is best for everyone.
In practice, erroring out on these cases may not be acceptable.


  • You remember that "UTF-8 should not contain surrogates, lone or otherwise?" Yeah, some implementations do deviate
For example
Java uses a slight (but incompatible) variant of UTF-8 called "Modified UTF-8"[5]

[6], used for JNI and for object serialization(verify).

similarly, WTF-8 takes the view that it can be more practical to work around other people's poor API implementations, rather than go "technically they broke it, not my problem, lalalala"
it allows representation of invalid UTF-16 (from e.g. windows) so that it can later ask it for the same invalid string


More technical notes

Unicode General Category

Unicode states that each codepoint has a general category (a separate concept from bidirectional category, which is often less interesting)


The general class is capitalized, the detail is in an added lowercase letter.


Letters (L):

  • Lu: Uppercase
  • Ll: Lowercase
  • Lt: Titlecase
  • Lm: Modifier
  • Lo: Other

Numbers (N):

  • Nd: Decimal digit
  • Nl: Letter (e.g. roman numerals like Ⅷ)
  • No: Other (funky scripts, but also subscript numbers, bracketed, etc)

Symbols (S):

  • Sm: Math
  • Sc: Currency
  • Sk: Modifier
  • So: Other

Marks (M):

  • Mn: Nonspacing mark
  • Mc: Spacing Combining mark
  • Me: Enclosing mark

Punctuation (P):

  • Pc: Connector
  • Pd: Dash
  • Ps: Open
  • Pe: Close
  • Pi: Initial quote (may behave like Ps or Pe depending on usage)
  • Pf: Final quote (may behave like Ps or Pe depending on usage)
  • Po: Other

Separators (Z):

  • Zs: Space separator,
  • Zl: Line separator,
  • Zp: Paragraph separator,

Other (C):

  • Cc: Control
  • Cf: Format
  • Cs: Surrogate
  • Co: Private Use
  • Cn: Not Assigned (no characters in the unicode.org list explicitly have this category, but implementations are expected to return it for every codepoint not in that list)


Pay attention to unicode version, since there are continuous additions and corrections. For example, older versions may tell you all non-BMP characters are Cn. I believe Pi and Pf were added later, so beware of testing for too few lowercase variations.


On modifiers and marks: TODO

Normal forms

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Or, what's with this equivalence and decomposition stuff?


Unicode's normal forms are mainly meant to compare whether codepoints (and strings containing them) mean the same thing semantically, and/or seem the same thing visually, even when they are not the same codepoints.

To that end, it

  • defines two types of equivalence: canonical equivalence, and compatibility equivalence.
  • lets you decompose characters/strings to what it considers its component parts

As almost a side effect of the last, this is also useful to normalize strings in some specific ways, sometimes to remove diacritics, and other things -- but because this is not its actual aim, you should assume Unicode does neither of those things well.


Equivalence

Canonical equivalence is mainly used to test things that test things that may show up in more than one way and are interchangeable.

Interchangeable in the sense of semantic equivalence: does it mean the same thing.

This usually also implies both will be rendered identically.


Consider é. It can show up in both of the following ways:

OR

  • é which looks the same but is actually two codepoints:

As far as canonical equivalence is concerned, these two are interchangeable.


Compatibility equivalence are for things you might want to reduce to the same thing, but are not interchangeable.

Consider the roman numeral Ⅳ as a single character, U+2163 (Ⅳ).

In some contexts it will be useful to be able to consider it the same as the basic-two-characters string "IV" - yet you would not want to generally consider the letters IV to be the roman numeral. (e.g. SHIV is an all-capsed word, that does not mean SH5)



Note that since

canonically equivalent is "basically the same thing",
compatibility equivalence is more of a fuzzy 'does this look roughly the same',

...anything that is canonically equivalent is implicitly also compatibility equivalent.

And not the other way around, so canonical equivalence is a symmetric relation, and compatibility is not.


Decomposition and composition (and tests using them)

Canonical decomposition and canonical composition will only separate and (re)combine things that are semantically equivalent.

For example, in python:

# canonical decomposition
unicodedata.normalize("NFD", "\u00e9")  == 'e\u0301'

# canonical composition 
unicodedata.normalize("NFC", 'e\u0301') == '\xe9'      


What they actually do is

  • NFD, a.k.a. Normalization Form D (and Normalization Form Canonical Decomposition)
will decompose only to canonically equivalent forms. Examples:
Roman character Ⅳ stays Ⅳ
e-egu becomes separate e and egu
  • NFC, a.k.a. Normalization Form C (and Normalization Form Canonical Composition)
will decomposes canonically, then composes canonically
Roman numeral character Ⅳ stays Ⅳ stays Ⅳ
separate e and egu becomes e-egu


When you want to

  • sort more sensibly
  • index text more sensibly
  • comparison to be semantic and/or fuzzy without doing this transform each time

...then you choose to e.g. run either composition or decomposition over all your text at input time.


There are further operations where normal forms are involved, like

  • removing certain accents (not really a designed feature, more of a 'you can often get decently far')
  • halfwidth and fullwidtth forms include a copy of the basic latin alphabet, mostly for lossless conversions between this and older encodings containing both halfwidth and fullwidth characters (verify)


The conversions aren't quite like the canonical variants. Consider

  • NFKD (Normalization Form Compatibility Composition) decomposes to compatibility-equivalent forms. Examples:
Roman numeral character Ⅳ becomes two-letter "IV" (unicodedata.normalize("NFKD", "\u2163") == 'IV')
e-egu becomes separate e and egu
  • NFKC (Normalization Form Compatibility Composition) decomposes to compatibility-equivalent forms, then recomposes canonically. Examples:
Example: Roman numeral characterⅣ becomes "IV", then stays "IV".
separate e and egu becomes e-egu



Note that there is no compatibility composition - it would make way too many semantically nonsensensical combinations. For example, it doesn't make sense for IV to ever ever become ℃ again.


Planes

Currently, the range open for plane and characer definitions is capped at U+10FFFF (20 bits), but the definition of UCS defines sixteen planes of 65536 (0xFFFF) characters each (using only about ten percent of the 20-bit range), the first fifteen of which are consecutive.

  • BMP (Plane 0, the Basic Multilingual Plane) stores scripts for most any live language. (most of U+0000 to U+FFFF). UCS2 was created to code just this plane.
  • SMP (Plane 1, the Supplementary Multilingual Plane) mostly used for historic scripts (ranges inside U+10000 to U+1DFFF)
  • SIP (Plane 2, the Supplementary Ideographic Plane) stores mHan-Unified characters (ranges inside U+20000 to U+2FFFF)
  • SSP (Plane 14, the Supplementary Special-purpose plane), contants a few nongraphical things, like language markers (a range in U+E0000 to U+EFFFF)
  • Planes 3 to 13 are currently undefined.
Plane 3 has tentatively been called the Tertiary Ideographic Plane, but (as of this writing) has only tentatively allocated characters
  • Plane 15 (U+F0000–U+FFFFF) and plane 16 (U+100000–U+10FFFF) are Private Use Areas A and B, which you can use for your own font design, non-transportable at least in the semantical sense.
(There is also a private use range in BMP, U+E000 to U+F8FF)

Byte Order Marker (BOM)

In storage you often deal with bytes.


UTF16 is 16-bit ints stored into bytes, so endianness becomes a thing.

It's useful to standardized any file serialization to be a specific endianness, or store its overall endianness, but you don't have to: you can optionally start each UTF-16 string with Byte Order Marker (BOM).

The BOM is character U+FEFF, which once it's bytes works out as either:

  • FE FF: means big-endian UTF-16 (UTF-16BE)
  • FF FE: means little-endian UTF-16 (UTF-16LE)


Notes:

  • U+FEFF was previously also used as a zero-width non-breaking space, so could also appear in the middle of a string.

To separate these function, BOM in that use is deprecated, and zero-width non-breaking space is now U+2060.

  • U+FFFE is defined never to be a character, so that the above test always makes sense
  • Because BOMs are not required to be handled by all Unicode parsing, it is conventional that:
    • The BOM character is typically removed by UTF-16 readers
    • If a file/protocol is defined to be a specific endianness, it will generally not contain the BOM (and may be is defined to never use it)



UTF-32 also knows BOMs: (though they are rarely used)(verify)

  • 00 00 FE FF: big-endian UTF-32
  • FF FE 00 00: little-endian UTF-32

(Note that if you don't know whether it's UTF-32 or UTF-16 (which is bad practice, really), the little-endian UTF32 BOM is indistinguishable(verify) from little-endian UTF-16 BOM and a string starting with a NUL (U+0000 (�)) (unlikely, but valid)



UTF-8 is a byte coding, so there is only one order it can have in byte storage.

You do occasionally find the BOM in UTF-8 (in encoded form, as EF BB BF). This has no bearing on byte order since UTF-8 is a byte-level coding already. Its use in UTF-8 is mostly as a signature, also to make a clearer case for BOM detectors, though it is usually pointless, and accordingly rare in practice.


Unicode Text Segmentation

Grapheme clusters

Other Unicode encodings

Unicode was its own thing when introduced, and the web now largely uses UTF-8.


There are other encodings, mostly created after Unicode was introduced, that can also encode Unicode characters, not least of which is GB 18030, a Chinese standard which can encode all Unicode planes (verify).


If you consider a fuller list (and excluding UTF-8, UTF-16, arguably UTF-32, and GB18030), then most seem to serve a specific purpose, including:

Compressed formats:

  • SCSU [7] (Standard Compression Scheme for Unicode) is a coding that compresses Unicode, particularly if it only uses one or a few blocks.
  • BOCU-1 [8] (Binary Ordered Compression for Unicode) is derived from SCSU

These have limited applicability. They are not too useful for many web uses, since for content larger than a few kilobytes, generic compression algorithms compress better.

SCSU cannot be used in email messages because it allows characters that may not appear inside MIME. BOCU-1 could, but won't until clients widely decide to support it and they have little reason to these days.


Codings for legacy compatibility:

  • UTF-7 [9] does not use the highest bit and is safe for (ancient) 7-bit communication lines
  • UTF-EBCDIC [10]


(TODO: look at:)

More practical notes

Note: Some of this is very much my own summarization, and may not apply to the contexts, systems, pages, countries and such most relevant to you.


On private use

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


As far as unicode is concerned, these codepoints exist, but have have no meaning.


In terms of semantics and accessibility, it's bad.

It may look like, say, Klingon to you, but not to a computer, unless both sides explicitly agree exactly which non-standard thing they are pasting on top.

If you don't pair it with a font, it won't even look like klingon to you.

So blind people will not see this, and text-to-speech cannot pick it up.


Private use mostly acknowledges that people will want to do this at all - without providing for it, people could only their own glyphs without having to replace some existing characters, and showing something that isn't the character it's defined to be is probably even more confusing (remember the mess that codepages and printer fonts made, and people being confused about Wingdings (it replaces standard ASCII values with their own thing)) than using codepoints that have no meaning.


And without saying what you're adding (and there is no standard) you won't even know.



You should probably avoid it in general unless you have specific reason to.

You should probably avoid it unless you will present it without the corresponding font.


Examples of use:

  • There projects like CSUR (ConScript Unicode Registry[13]) is an organized attempt to assign conlangs to private use ranges in a way that won't conflict - also meaning you can often mix them with only minimal font wizardry




On Asian characters

Beyond plain visible characters

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

There are a few dozen codepoints officially labeled <not a character>, which don't code anything. This e.g. includes

  • the last two codepoints in each plane
e.g. U+FFFE and U+FFFF on the BMP, which are used for BOM use
  • U+FDD0..U+FDEF
apparently "to make additional codes available to programmers to use for internal processing purposes" [14]. You can think of this as a different type of private use
unrelated to the Arabic Presentation Forms-A range it is in, it just had space unlikely to be used


There are a lot of other things that are allocated codepoints, but are not / do not contribute to visible glyphs.

This is not unicode definition, just for my own overview.

  • Control characters (definitions from before Unicode)
    • 'C0 range', referring to the ASCII control characters U+00 to U+1F
    • 'C1 range', U+80-U+9F.
    • U+7F (DEL) is often also usually considered a control character.
  • Surrogates (U+D800 to U+DFFF in two blocks)
in that seen in isolation, they code no character (and make the string ill-formed)
Their real purpose is to be able to code higher-than-BMP codepoints in UTF-16 (including various UTF16-based Unicode implementations)


  • Variations
    • Variation Selectors FE00 to FE0F - used to select standardized variants of some mathematical symbols, emoji, CJK ideographs, and 'Phags-pa [15]
    • Variation Selectors Supplement U+E0100 to U+E01EF [16]
See the Ideographic Variation Database[17][18]
short summary: Emoji uses VS16 (U+FE0F) (and sometimes VS15 U+FE0E to force non-emoji representation); non-CJK sometimes uses VS1 U+FE00; CJK uses VS1, VS2, VS3; VS4 through VS14 are not in use (as of Unicode 11)
Originally intended to tag languages in text (See also RFC 2482), this was deprecated from Unicode 5 on
Unicode 8 and 9 cleared the deprecated status (for all but U+E0001) to allow some other use in the future
https://en.wikipedia.org/wiki/Tags_(Unicode_block)


Combining characters Combining characters - visible, but need another character to go on


See also:


Mixing and messing encodings - mojibake

If you take something encoded into bytes (UTF-8, codepage, other) and assume/guess wrong about what they are or what they should be decoded as (or maybe encoding it twice), then you get a garbeled mess.


One of the wider names for such mistakes is mojibake.


And there's a bunch of possible combinations, so it varies whether you can detect the specific problem, and whether you can then fix it after the fact.


There are some more specific forms that are easier to recognize.

Bad encoding-decoding combinations: Â, Ã, etc.

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

When you take a text coding that was already encoded into bytes (charsets or UTF-8 or other), and encode that using UTF-8 (or something else), you get mangled data.


There's many variations on that theme, but the most common are probably

UTF-8 encoding Latin1/CP-1252 byte data, and
UTF-8 encoding UTF-8 byte data.


Nineties to noughties, the internet started storing and sending UTF-8, but web browsers would often still think Latin1/CP-1252 was a thing we wanted, and webservers might even tell us so explicitly (usually as a default/fallback, when you can't know).

This largely went away because browsers started defaulting to UTF-8 instead and, to a lesser degree, people started being better about specifying the encoding in use. Or the standard said so - e.g. HTML5 made a point of being UTF-8 by default.



Reasons, and recognizing these

  • bytes-as-codepoints will always sit within U+00 and U+FF
UTF8 bytes will be within U+80 and U+FF
while they can be anything within it, in western practice the first byte may stick to fewer values than that - consider...


  • U+0080 through U+07FF covers much of the additional characters seen in Europe and other Latin-related alphabets
those codepoints all become two-byte UTF8, and the first byte of which sits within 0xC2 through 0xDF.
If you misinterpret those as codepoints as-is, well, U+C2 through U+DF is Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
more specifically
the bulk of western Europe's accented letters sits within U+A0..U+FF (in Latin Supplement), which in terms of UTF8 first bytes is 0xC2 and 0xC3 (if you misinterpret as codepoints: Â Ã)
some less usual (and unusual) characters sit in Latin Extended-A and Latin Extended-B, U+100 through u+24F, UTF-8 first bytes 0xc4 through 0xc9; if you misinterpret: Ä Å Æ Ç È É
combining diacritical marks's first UTF8 bytes are 0xCC, 0xCD (Ì Í)
Things like Greek, Cyrillic, Hebrew, Arabic 0xCD through, 0xDD (Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý


  • there are some specific outliers
e.g. word processors are known to replace `apostrophes characters around words' with ‗ (U+2017, LEFTSINGLE QUOTATION MARK) and ‘ (U+2018, RIGHT SINGLE QUOTATION MARK),
which as UTF8 bytes is e2 80 98 and e2 80 99
if these bytes are shown via CP-1252 (as they used to be), you will see ’ (for RIGHT SINGLE QUOTATION MARK)
if these bytes are shown as codepoints, (U+E2 U+80 U+99), this is less visible, because while U+E2 is â, the last two are control codes, and won't have a printed character.


In a good number of cases you can un-mangle this if you manage to determine what happened. In some cases you can have a very good guess (e.g. seeing U+E2 U+80 U+99 a bunch of times in otherwise basic western text), other otherwise automatically estimate that from the data you give it.


http://www.i18nqa.com/debug/utf8-debug.html

Emoji

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Types of emoticons / emoji (wider view than unicode)

Short history:

  • Emoticons originally mostly referred to combinations of ASCII, mostly simple smilies like :)


  • ...plus other characters we already had (unicode or not), most of them local ones, e.g.
Japanese kaomoji which started as ASCII like (*_*) (also seen in the west) but now have variants like (´。• ω •。`)
Korean style (like ㅇㅅㅇ),
down to patterns just seen in one place more than another, like ó_ò in Brazil, ಠ_ಠ initially on forums, etc.


  • some sites, mostly message boards, would replace specific text (like :)) with images.
Often mostly the simpler ASCII smilies, but some more extensive
  • The same boards are also known for adding further shortcodes, specific strings like :bacon: as unique test markers to always replace with images
more flexible, sometimes board-specific - and for that same reason is less widespread


  • Later, a decent number of well used emoji were added to Unicode (most after 2010)
which would render as-is assuming you have a font, however...
various sites/devices do image replacement for these Unicode, to get color and their specific style.
to the point that this is part of their text editors


Wider support of emoji in Unicode emoji made things more interesting (for a few different reasons).

Image replacement of Unicode was initially more often HTML-level replacement of a character with an img tag. These days the image replacement is often more integrated, where an app and/or phone input fields and/or browser(verify) shows the replacement as you type.

The same browsers/apps will also often provide decent emoji keyboards/palletes to make it much easier to enter them in the first place.

This kind of support is fairly recent development (see e.g. http://caniemoji.com/), and getting pretty decent.

You can consider this more native support of emoji -- except also somewhat misleading. What gets sent is not the image you see but the unicode underlying it. And because there's now at least a dozen sets of images, it may look different in another app.

More on image replacement

App can choose to replace creative-use-of-character emoji, of shortcodes, and/or unicode - with unicode and/or images.

We've somewhat shifted away from shortcodes, and from using custom image sets as a replacement (both work poorly between systems).


We've shifted to to unicode emoji, probably in part because input palettes being largely unicode made those more reachable.


Apps may still use their own private set of image replacement, but this is mostly brand-specific aesthetics, of the same shared underlying characters.


Note that

  • the amount of characters any of these replace varies.
  • apps may implicitly inherit this set from the platform they are on
particularly phone apps may get this
  • apps may explicitly adopt one of them
  • and yes, there have been some cases where the image aesthetics actually changed meaning, which led to miscommunication between people on different platforms
https://grouplens.org/blog/investigating-the-potential-for-miscommunication-using-emoji/


Emoji sets include:

Google [19] [20]
Microsoft [21]
Apple [22]
Samsung [23]
HTC [24]
LG [25]
WhatsApp [26]
Twitter [27], also used by e.g. Discord
Snapchat [28]
Facebook [29]
Mozilla [30]
EmojiOne [31] - licensable set used e.g. by Discourse, Slack
GitHub [32]
GMail - had a smaller set (used in subject lines, often square-looking)


Emoji according to unicode

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

While a bunch of emoticon-like characters have been around for very long. If you include varied symbols like ☹ ☺ ☻ since Unicode 1.

Only later (Unicode 6 (~2010) and later) did we start (bulk) adding emoji in more dedicated ranges. By amount, these are the bulk of what we now consider emoji.

Data that standardized Emoji came later, with Emoji 1 (~2015) which started its own versioning (1, 2, 3, 4, 5, 11, 12, 13 - the jump comes from a choice to synchronize Emoji version with Unicode version).


Note also that Emoji makes a distinction between emoji representation and non-emoji representation. Any one character may have either or both, and more interestingly, some characters got an emoji representations to some characters well after their first introduction (e.g. ❤ got ❤️).


Emoji 1 was largely a roundup of all the emoji-like codepoints from the first seven(verify) versions of Unicode,

approx 1200 of them (counting flags)


Emoji 2 (came soon after 1) introduced sequences, and defined specific uses of (already-existing) skin tone, plus some other modifiers.

Versions since then have mostly just expanded on characters, sequences, and modifiers.

Emoji 3

Emoji 4

Emoji 5

Emoji 11

Emoji 12 and Emoji 12.1

Emoji 13 and Emoji 13.1

Emoji 14




Notes:

  • Note that brand new emoji often are not immediately known in browsers/devices, and font support generally lags much more.
  • If you're going to parse unicode's emoji data, you'll probably want to read [33].




See also:



Flags

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Flags do not have singular codepoints - this would be slightly awkward to keep up to date due to changing geopolitics.


When you restrict yourself to countries, things stay relatively simple. Sort of.

Unicode 6 added an range of letters used only as regional indicators (U+1F1E6 through U+1F1FF, within Enclosed Alphanumeric Supplement).

It comes down to "use these in pairs to refer to ISO 3166-1 alpha-2 code"[34].

(spoilers: except exceptions)


So, for example, 🇧🇪 is actually the sequence of

U+1F1E7 REGIONAL INDICATOR SYMBOL LETTER B
U+1F1EA REGIONAL INDICATOR SYMBOL LETTER E

Doing it this way means

  • there's just one authoritative place to keep up to date.
  • changes in countries is up to implementations following changes to ISO 3166, not definition within Unicode.
  • ...flags can actually show up differently in a lot of places.



One practical question is "where do you stop with regions?"

The above is an answer, but not a fully satisfying one.

Consider the UK.


This also runs right into politics.

A Microsoft or an Apple can, say, decide to include the Israelian flag and Palestinian flag. Might be a bit contentious depending on where you are at the moment, but you can defer to some type of neutrality this way.

But what to think about Apple removing the ability to render Taiwanese flag (TW, 🇹🇼) only if the locale is mainland China? The reason clearly seems that the flag can symbolize Taiwanese independence, which the Chinese government is, ahem, not a fan of - but Apple still wants to sell iPhones to a billion chinese people, soooo.... You now probably have opinions on the ethical stance that Apple is not taking there.

Windows seems to have specifically dropped flag support (browsers have their own implementation because windows does not). Which might well be related to such politics.


Unicode just side-stepped such issues issue in the first place by basically saying "we give you a system, you decide how to use it"


Unsorted

Right-to-left

https://www.w3.org/International/articles/inline-bidi-markup/uba-basics

http://unicode.org/reports/tr9/