Text coding

From Helpful
Jump to: navigation, search

An oversimplified history of text coding

(rewriting)



Codepage notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


General and/or western

  • ASCII
    • Also sometimes called USASCII to distinuish it from ASCII variants
    • Equivalent to ANSI X.3.4-1968, IEC 646, International Reference Alphabet (IRA) (defined in ITU-T T.50), IA5 (International Alphabet No.5, the older name for IRA), ECMA-6
    • Single-byte. Assigns 0x00 through 0x7F, leaves 0x80-0xFF undefined


  • Latin1, ISO 8859-1
    • one of about a dozen defined in ISO 8859
    • A codepage; extends ASCII to cover about two dozen european languages, and almost covers perhaps a dozen more


  • Windows-1252, also codepage 1252, cp1252, sometimes WinLatin1
    • Default codepage for windows in western europe(verify)
    • a superset of Latin1 (which makes WinLatin1 a potentially confusing name)
    • Defines the 0x80-0x9F range (technically undefined in Latin1 but often used for C1 control characters) which it uses for some printable characters there instead, including €, ‹ and › (german quotes), „, Œ, œ, Ÿ, ž, Ž, š, Š, ƒ, ™, †, ‡, ˆ, ‰, and some others.




Mostly localized (to a country, or one of various used in a country)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Numbered codepages are often shorthanded with cp, e.g. cp437.

Defined by/for OSes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
  • Mac OS Roman
    • the most common Macintosh coding up to and excluding OSX (which adopted UTF-8)
    • Known to IANA under the MACINTOSH name, which can be a little misleading as there are several different Apple codings from about the same time.
  • Windows-1252
    • also known as WinLatin1 (which can be confusing, because it conflicts with Latin1, if only on one point)
    • a superset of the basic ISO/IEC 8859-1 (a.k.a. Latin1) set, though strictly conflicting, in that it uses characters in the C1 set (which are invalid in 8859-1).
  • Windows-1251


Semi-sorted

  • The 7-bit character set defined in GSM 03.38 (...section 6.2.1) [1] (sometimes abbreviated 'GSM')


See also

Unsorted:

Unicode

tl;dr: Unicode:

  • Unicode itself is a character set. And enumeration of such, in the form of codepoints
  • defines a number of different byte codings
  • covers the script of all known live languages and many dead ones
  • Unicode intends to code characters, not glyphs.


A central authoritative list of all the characters there are, with stable numbering for them (codepoints) is wildly useful. It's also also the most central thing that unicode is.

Unicode is one of the broadest character maps, and a fairly thoughtful one at that. Unicode has in fact been used to define some new codepages, and has been used to clarify older codepages, particularly their transforms to and from Unicode.


As a footnote, Unicode was originally considered 16-bit, then considered considered open-ended, but currently capped, meaning that anything above U+10FFFF is invalid.

...which is plenty. This allows 1.1 million codepoints, but we're using only a small portion of that:

approx 110K are usable characters
about 70K of those 110K are CJK ideographs
approx 50K are in the BMP (Basic Multilingual Plane), 60K in higher planes (verify)
~130K are private use (no character assigned; can be used for local convention, not advised to use to communicate anything)
approx 900K are valid but unassigned so this is likely to last a long while.



There is some criticism of some choices in its character set/enumeration, and there is even a point to some of that criticism.

Asian characters and Han unification have had hiccups, and there are a number of smaller details you can also argue over. (Still, it's a lot more convenient to use Unicode than to wait for something better or create something only you understand).



The terminology can matter a lot (to the point I'm sure some sentences here are incorrect) because once you get into all the world's language's scripts, you have to deal with a lot of real-world details, from adding accents on other characters, to ligatures, to emoji, to right-to-left languages, to scripts where on-screen glyphs change depending on adjacent characters.

For example:

  • Characters - have meanings (e.g. latin letter a)
  • A glyph has a style.
Many character can appear as fairly distinct-looking glyphs - consider the few ways you can write an a (e.g. cursive, block). Exactly the same function, but different looks.
  • also, some scripts will change the glyph used for a character in specific sequences of code points
  • consider languages that never separate specific characters (e.g. ch in Czech),
  • the concept of text segmentation for e.g. cursor movement - which is sometimes multiple codepoints - a 'grapheme cluster' becomes a more useful concept, see e.g. UAX #29.

To most of us, the takeaway from that is that codepoints tend to be semantic things - even if most of the time and for most of the world they work out as one glyphs put after the other.



U+ codepoint denotation

A particular Unicode codepoint is basically an integer, and typically denoted using the U+ format that uses hexadecimal, and frequently seen left-padded with zeroes.


For example, the character 32 (a space, taken from ASCII) would be or U+20. Or U+0020.


Zeroes are removed when parsed, so how many zeroes to pad with is mostly personal taste/convention. So U+05 is U+0000005 is U+0005 is U+5.


What unicode isn't

Storing, altering

So, say, you now know that U+2164 (Ⅴ) is roman numeral V, so how do you communicate it?


A lot of document formats opt for UTF8, or UTF16.

And are either defined to use that always, or mark what they use within the document.


Some cases have further options. For example most HTML and most XML allow three options: numeric entities like é or é, named entities like é (except it's a smallish list, and XML doesn't define those), and raw bytestrings according to the document encoding, e.g. that example would be 0xC3 0xA9 in UTF8


Dealing with unicode in programming is a more complex topic.

In-memory, UCS4 (just the codepoint as plain integers) is simplest for you to deal with, but there are practical reasons (storage space) and historical reasons (like Windows APIs initially being UCS2 so retrofitting it to UTF-16 was easier) some implementations actually opt for UTF-16 instead.

UTF-16 makes operations altering unicode strings more complex - even just counting codepoints means you have to decode the string. But it's still doable, and done. And you'ld probably be using libraries anyway.


And most operating systems and most programming languages worth mentioning support Unicode at some level, relieving programmers from having to be obsessive-compulsive experts to get it right. Still, you have to know approximately what you're doing.




UCS

UCS (in itself, without a number) refers to the Universal Character Set - abstractly to the character set that Unicode defines.


In programming you more usually see references to UCS2 or UCS4, ways of storing non-encoded codepoints in fixed-size elements (16-bit and 32-bit uints). (and from the perspective of bytes, you are using one of UCS2BE, UCS2LE, UCS4LE, or UCS4BE)

Unicode libraries implementations often use an array of integers, in the machine's native endianness for speedy operations and simplicity, where each integer is either 2 bytes large (UCS2) or 4 bytes large (UCS4) (the latter of which supports all of unicode, the former only the BMP).

Various common operations are much simpler in this format than it is in encoded data (than in UTF), including finding the amount of characters, doing bidirectional movement with a string, comparing two strings, taking a substring, overwriting characters, and such.

Yes, you could back an implementation with e.g. UTF16 instead, which is more memory-efficient, but the logic is a little slower, and much hairier. But it's doable, and done in practice.


UTF

UTF (Unicode Transformation Format) refers to various variable-length encoding of unicode.

The two most common flavours are UTF-8 and UTF-16 (Others exist, like UTF-32 and UTF-7, but these are of limited practical use).


Because this is a variable-bytes coding, only left-to-right movement in these strings is simple. Another implication is that various operations (see above) are more work, and often slower.


Note that UTF-16 is a strict superset of UCS2, and that UTF-16 only adds the surrogate pairs, so it is backwards compatible. In fact, windows before 2000 used UCS2, while 2000 and later use UTF-16 to support all characters and break nothing. Decoding UCS2 as UTF-16 won't fail, while a strict UCS2 reader would discard UTF16 surrogate pairs as nonsense.

(In the case of windows, it means unicode is stored in less memory than UCS4 while supporting the same character set -- but also that various string operations are more complex and somewhat slower than in a UCS4 implementation)

This is also one reason UCS2 is often considered to be outdated - there is no good reason not to use UTF-16 instead.


Note that while UTF-8 could code codepoint values up to 231, UTF-16's surrogate-pair setup means it can only code values up to 220, which is the reason for the codepoint cap at U+10FFFF.


Space taken

UTF-8 uses

  • 1 byte up to U+7F
  • 2 bytes for U+80 through U+7FF
  • 3 bytes for U+800 through U+FFFF
  • 4 bytes for U+10000 through U+10FFFF
  • (designed to code higher codepoints with 5-byte and 6-byte sequences, though the current cap means that won't happen)

Note that:

  • Most ASCII is used as-is in Unicode, particularly the printable part, so is itself valid UTF-8 bytes
exceptions are control characters (those below U+20). Which are not really part of ASCII. They exist, though.
  • ...also meaning english encoded in UTF-8 is readable enough by humans and other Unicode-ignorant systems, and could be edited in most text edtors as long as it leaves the non-ASCII bytes alone.
  • All codepoints above U+80 are coded with purely non-ASCII values, and zero bytes do not occur in UTF8 bytes or most ASCII text, meaning that
you can use null-terminated C-style byte strings to store UTF8 strings
C library string functions won't trip over UTF-8, so you can get away with carrying them that way
...just be aware that unicode-ignorant string altering may do bad things


UTF-16 uses

  • 2 bytes for U+0000 through U+FFFF
  • 4 bytes for U+10000 through U+10FFFF (by using surrogates in pairs, where surrogates are codepoints from the U+D800–U+DBFF range that refer to >U+FFFF characters when decoding UTF-16 to UCS4)


UTF-32 stores all characters under the current character cap (U+10FFFF) unencoded, making is superficially equivalent to UCS4 (It isn't quite: UCS is a storage-ambivalent enumeration, UTF is a byte coding with unambiguous endianness)


If you want to minimize stored size/transmitted bandwidth, you can choose between UTF-8 or UTF-16 based on the characters you'll mostly encode.

the tl;dr is that

UTF-8 is shortest when you store mostly western alphabetics
because characters under U+800 are coded with 2 bytes, and most western-ish alphabetic languages almost exclusively use codepoints under it - and then mostly ASCII, coded in a single byte, so this will typically average less than 2 bytes per character.
while UTF-16 is sometimes better for CJK and mixed content
UTF-16 uses exactly 2 bytes for everything below U+FFFF, and 4 bytes for everything above that point.
so in theory it'll average somewhere between 2 and 4 for UTF-16
in practice, the most common CJK characters are in the U+4E00..U+9FFF range, so you may find it's closer to 2 -- while in UTF-8 most are 3 or 4 bytes.
((verify) exactly how UTF-8 and UTF-16 practically compare for CJK)

More technical notes

Unicode General Category

Unicode states that each codepoint has a general category (a separate concept from bidirectional category, which is often less interesting)


The general class is capitalized, the detail is in an added lowercase letter.


Letters (L):

  • Lu: Uppercase
  • Ll: Lowercase
  • Lt: Titlecase
  • Lm: Modifier
  • Lo: Other

Numbers (N):

  • Nd: Decimal digit
  • Nl: Letter (e.g. roman numerals like Ⅷ)
  • No: Other (funky scripts, but also subscript numbers, bracketed, etc)

Symbols (S):

  • Sm: Math
  • Sc: Currency
  • Sk: Modifier
  • So: Other

Marks (M):

  • Mn: Nonspacing mark
  • Mc: Spacing Combining mark
  • Me: Enclosing mark

Punctuation (P):

  • Pc: Connector
  • Pd: Dash
  • Ps: Open
  • Pe: Close
  • Pi: Initial quote (may behave like Ps or Pe depending on usage)
  • Pf: Final quote (may behave like Ps or Pe depending on usage)
  • Po: Other

Separators (Z):

  • Zs: Space separator,
  • Zl: Line separator,
  • Zp: Paragraph separator,

Other (C):

  • Cc: Control
  • Cf: Format
  • Cs: Surrogate
  • Co: Private Use
  • Cn: Not Assigned (no characters in the unicode.org list have this category, but implementations are expected to return it for every codepoint not in that list)


Pay attention to unicode version, since there are continuous additions and corrections. For example, older versions may tell you all non-BMP characters are Cn. I believe Pi and Pf were added later, so beware of testing for too few lowercase variations.


On modifiers and marks: TODO

Normal forms

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Unicode's normal forms allow halfway-smart comparison and conversion between things that can be considered semantically and/or visually equivalent.


On equivalence

Unicode defines two types of equivalence:


Canonical equivalence is used for rather strict semantical equivalence, used to test things that may show up in more than one way. (This usually also implies both will be rendered identically)


Consider é. It can show up in both of the following ways:

  • é is a LATIN SMALL LETTER E WITH ACUTE (U+00E9)
  • é looks the same but is actually a LATIN SMALL LETTER E (U+0065) followed by a COMBINING ACUTE ACCENT (U+0301)

These two are canonically equivalent, which roughly speaking they test as the same thing, so are interchangeable as far as this test is concerned.


Now consider the roman numeral Ⅳ (as a single character, U+2163).

You may well want to consider it the same as the basic-two-characters string "IV", but only for certain convenience - not in a "these are interchangeable" way. This fuzzier matching is considered compatibility equivalence.


Note that:

  • since compatibility equivalence is more of a fuzzy 'does this look about the same', anything that is canonically equivalent is implicitly also compatibility equivalent.
Not the other way around, obviously, which leads to the following point
  • canonical equivalence is a symmetric relation, and compatibility is not.
It is, for example, not at all a general truth that every sequence of IV should become the roman numeral.



On conversions; normalization

Considering the above, and the following python code:

unicodedata.normalize("NFD", u"\u00e9") == u'e\u0301'
unicodedata.normalize("NFC", u'e\u0301') == u'\xe9'

These examples use only canonical equivalence, which basically means they retain all semantic information. This is why it's useful for tasks like storing in a directly comparable way, for sorting (think collations), for text indexing, operations like removing accents, etc.


What they actually do is

  • NFD (Normalization Form Canonical Composition) decompose only to canonically equivalent forms. Examples:
Roman character Ⅳ stays Ⅳ
e-egu becomes separate e and egu
  • NFC (Normalization Form Canonical Composition) decomposes canonically, then composes canonically. Examples:
Roman numeral character Ⅳ stays Ⅳ stays Ⅳ
separate e and egu becomes e-egu


There is also value to mixing in compatibility equivalence.

Consider e.g. you know that a few people use Ⅳ but most typed IV. You may not want to alter the document with the compatibility-based transform - as this may be (but isn't always) destructive of the text in the semantic sense - but this is a good transformation while you're building a search index, as it makes it likelier you'll find both while looking with either term.

Not all possible combinations of conversions make sense. In particular, there is no compatibility composition - it would make way too many semantically nonsense combinations.

What you get is:

  • NFKD (Normalization Form Compatibility Composition) decomposes to compatibility-equivalent forms. Examples:
Roman numeral character Ⅳ becomes two-letter "IV"
e-egu becomes separate e and egu
  • NFKC (Normalization Form Compatibility Composition) decomposes to compatibility-equivalent forms, then recomposes canonically. Examples:
Example: Roman numeral characterⅣ becomes "IV", then stays "IV"
separate e and egu becomes e-egu


Notes:

  • The names are confusing, yes. (All of them have composition in the name. A little confusing, as some of them could be considered to just decompose and some decompose and recompose. Though using those terms does not clarify a whole lot :)
  • C is always a two-step process (decomposition, composition)
The difference between NFC and NFKC is whether that first step, decomposition, is compatibility or canonical



More notes

Note that even NFKD does not do all the fuzziness you might want for searching.

For example, "Röyksopp" would be indexed as "Royksopp", and as long as you apply the same transformation to focument and query, you'll match both with both.

...but not "Røyksopp", because there is no linguistic/semantic reason to make ø equivalent to o-and-something. (It is considered in the confuseable set, but you may not want to count on that)


Also, only cases of clear, not-so-arguable equivalence is used. Some cases you could expect may not necessarily be so. For example, the German ß (eszett) is not equivalent to 'ss'.

Planes

Currently, the range open for plane and characer definitions is capped at U+10FFFF (20 bits), but the definition of UCS defines sixteen planes of 65536 (0xFFFF) characters each (using only about ten percent of the 20-bit range), the first fifteen of which are consecutive.

  • BMP (Plane 0, the Basic Multilingual Plane) stores scripts for most any live language. (most of U+0000 to U+FFFF). UCS2 was created to code just this plane.
  • SMP (Plane 1, the Supplementary Multilingual Plane) mostly used for historic scripts (ranges inside U+10000 to U+1DFFF)
  • SIP (Plane 2, the Supplementary Ideographic Plane) stores mHan-Unified characters (ranges inside U+20000 to U+2FFFF)
  • SSP (Plane 14, the Supplementary Special-purpose plane), contants a few nongraphical things, like language markers (a range in U+E0000 to U+EFFFF)
  • Planes 3 to 13 are currently undefined.
Plane 3 has tentatively been called the Tertiary Ideographic Plane, but (as of this writing) has only tentatively allocated characters
  • Plane 15 (U+F0000–U+FFFFF) and plane 16 (U+100000–U+10FFFF) are Private Use Areas A and B, which you can use for your own font design, non-transportable at least in the semantical sense.
(There is also a private use range in BMP, U+E000 to U+F8FF)

Byte Order Marker (BOM)

In storage you often deal with bytes.


UTF16 is 16-bit ints stored into bytes, so endianness becomes a thing. It's useful to standardized a file to always be a specific endianness, but you don't have to: you can optionally start each UTF-16 string with Byte Order Marker (BOM).

The BOM is character U+FEFF, which once it's bytes works out as either:

  • FE FF: means big-endian UTF-16 (UTF-16BE)
  • FF FE: means little-endian UTF-16 (UTF-16LE)

(U+FFFE is defined never to be a character, so that the above test always makes sense)


UTF-32 also knows BOMs: (though they are rarely used)(verify)

  • 00 00 FE FF: big-endian UTF-32
  • FF FE 00 00: little-endian UTF-32

Note that if you don't know whether it's UTF-32 or UTF-16 (which is bad practice, really), The little-endian UTF32 BOM is indistinguishable from little-endian UTF-16 BOM in a string that starts with a NUL (U+0000)


UTF-8 is a byte coding, so there is only one order it can have in byte storage.

You do occasionally find the BOM in UTF-8 (in encoded form, as EF BB BF). This has no bearing on byte order since UTF-8 is a byte-level coding already. Its use in UTF-8 is mostly as a signature, also to make a clearer case for BOM detectors, though it is usually pointless, and accordingly rare in practice.



The BOM was previously also used as a zero-width non-breaking space, so could also appear in the middle of a string. To separate these function, BOM in that use is deprecated, and zero-width non-breaking space is now U+2060.


Because BOMs are not required to be handled by all Unicode parsing, it is conventional that:

  • If a file/protocol is defined to be a specific endianness, it will generally not contain the BOM
sometimes prohibits its use
  • The BOM character is typically removed by UTF-16 readers.

Other Unicode encodings

Since Unicode is its own definition and relatively new, only UTF and other encodings created after Unicode was defined encode Unicode characters. One exception is GB 18030, a chinese standard which can encode all Unicode planes (verify)).


Aside from UTF-8, UTF-16, arguably UTF-32, and GB18030, most of these serve a specific purpose, including:


Compressed formats:

  • SCSU (Standard Compression Scheme for Unicode) is a coding that compresses Unicode, particularly if it only uses one or a few blocks.
  • BOCU-1 (Binary Ordered Compression for Unicode) is derived from SCSU

These have limited applicability. They are not too useful for many web uses, since for content larger than a few kilobytes, generic compression algorithms compress better.

SCSU cannot be used in email messages because it allows characters that should not appear inside MIME. BOCU-1 could, but would require wide client support and is unlikely to overtake UTF.


Codings for legacy compatibility:

  • UTF-7 does not use the highest bit and is safe for ancient 7-bit communication lines
  • UTF-EBCDIC


(TODO: look at:)

  • CESU-8
  • Punycode

More practical notes

Note: Some of this is very much my own summarization, and may not apply to the contexts, systems, pages, countries and such most relevant to you.


On Asian characters

Beyond plain visible glyphs

Non-characters

There are around 66 things officially labeled <not a character>, which aren't invalid but also don't code anything.

This e.g. includes the last two codepoints in each plane (e.g. U+FFFE and U+FFFF on the BMP), and U+FDD0..U+FDEF (a range falling within Arabic Presentation Forms-A), effectively private use.


Non-visible characters

There are a lot of other things that are allocated codepoints, but are not / do not contribute to visible glyphs. This is not unicode definition, just my own list.


  • Control characters (definitions from before Unicode)
    • 'C0 range', referring to the ASCII control characters U+00 to U+1F
    • 'C1 range', U+80-U+9F.
    • U+7F (DEL) is often also usually considered a control character.
  • Surrogates (U+D800 to U+DFFF in two blocks)
in that seen in isolation, they code no character
Their real purpose is to be able to code higher-than-BMP codepoints in UTF-16 (including various UTF16-based Unicode implementations)


  • Variations
    • Variation Selectors FE00 to FE0F - used to select standardized variants of some mathematical symbols, emoji, CJK ideographs, and 'Phags-pa [2]
    • Variation Selectors Supplement U+E0100 to U+E01EF [3]
See the Ideographic Variation Database[4][5]
short summary: Emoji uses VS16 (and sometimes VS15 to force non-emoji representation); non-CJK sometimes uses VS1; CJK uses VS1, VS2, VS3; VS4 through VS14 are not in use (as of Unicode 11)
Originally intended to tag languages in text (See also RFC 2482), this was deprecated from Unicode 5 on
Unicode 8 and 9 cleared the deprecated status (for all but U+E0001) to allow some other use in the future
https://en.wikipedia.org/wiki/Tags_(Unicode_block)


Combining characters Combining characters - visible, but need another character to go on


See also:

Double-up encoding: Â, Ã, etc.

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

When you take a text coding that was already encoded into bytes, think they are codepoints (which happen to all be in the U+00 .. U+FF range) the encode that using UTF-8 (or something else), you get mangled data.


The two most common cases are probably UTF-8 encoding Latin1/WinLatin1 byte data, and UTF-8 encoding UTF-8 byte data.

You'll often see character sequences starting with  (U+C2) or à (U+C3), followed by some other accented latin characters or some symbol. This because most western codepoints lie in the U+0080 through U+07FF range (and mostly at the start of that), which encode to two-byte UTF8 sequences starting with U+C2 through U+DF, which are mostly recognizable latin-ish characters (Â, Ã, Ä, Å, Æ, Ç, È, É, ..., ß).

Other sequences are longer. For example, a quotation mark when it gets rewritten (by some software) to U+2019 (’) will show up as ’ when double-UTF-8'd.


In a good number of cases you can un-mangle this. It involves some recognition of what the first encoding was, though.


Emoji

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Types of emoticons/ emoji

Short history:

  • Emoticons/emoji originally mostly referred to combinations of ASCII, initially simple smilies like
    :)


  • Some more local ones, like Japanese kaomoji which started as ASCII like (*_*) (also seen in the west) but now have variants like (´。• ω •。`), Korean style (like ㅇㅅㅇ), down to patterns just seen in one place more than another like ó_ò in brazil, ಠ_ಠ initially on forums, etc.
  • some sites (mostly forums) would replace specific text with images - often mostly the simpler ASCII smilies, but some more extensive
  • The same sites/forums also known for using shortcodes like
    :bacon:
    to replace with images
more flexible in subject, and more restricted in what was there in the list, so less widespread


  • Later, a decent number were added to Unicode.


Unicode emoji made things a little more interesting.

Platforms can do image replacement of the unicode characters, and various do to have their own style.

This initially was more often HTML-level replacement of a character with an img tag.

These days the image replacement is often more integrated, by the app or even browser(verify) - you can consider this more native support of emoji, as they also show up fully stylized in input text fields.

The same browsers/apps will also often provide decent emoji keyboards/palletes to get them in there in the first place.

This kind of support is fairly recent development (see e.g. http://caniemoji.com/), and getting pretty decent.

More on image replacement

App can choose to replace creative-use-of-character emoji, of shortcodes, and/or unicode. More recently it's mostly shifted to unicode emoji.


Apps often have their own set of image replaement. The amount of characters any of these replace varies.


There are also a good number of apps that either use the platform-native set, or specifically adopted one of these.


Emoji sets include:

Google [6] [7]
Microsoft [8]
Apple [9]
Samsung [10]
HTC [11]
LG [12]
WhatsApp [13]
Twitter [14], also used by e.g. Discord
Snapchat [15]
Facebook [16]
Mozilla [17]
EmojiOne [18] - licensable set used e.g. by Discourse, Slack
GitHub [19]
GMail - had a smaller set (used in subject lines, often square-looking)

Emoji according to unicode

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

While a bunch of emoticon-like characters have been around for very long, the first bulk introduction of Emoji characters came in Unicode 6.0 (~2010).

Standardized Emoji data came later (~2015) and has its own versioning (1, 2, 3, 4, 5, 11, 12 - the version jump comes from a choice to synchronize Emoji version with Unicode version).

Emoji 1 is a relative simple set of chacaters. Emoji 2 (came soon after) introduced sequences, and a few modifiers (mainly skin tone), and versions since then have mostly just expanded on characters, sequences, and modifiers.

We're currently at ~1200 emoji, not counting modifiers or the sequences.


Keep in mind that some codepoints are emoji only, meaning that the bulk of emoji come from a few dedicated ranges added in the last decade. This refers mostly to:


But a number of emoji had been around for a while (e.g. ☹ ☺ ☻ since Unicode 1), and a bunch of regular symbols mentioned in the emoji data are now defined to hace an emoji style (e.g. ❤), which include a few regions with mostly symbols, including:


And there are a handful of other characters with emoji styles in other ranges

  • a few from Basic Latin
  • a few from General Punctuation
  • a few from Letterlike Symbols
  • a few from CJK Symbols and Punctuation
  • and various others (TODO: complete this list)


And a few special cases like how flags are coded from Enclosed Alphanumerics Supplement.


If you consider the characters used in kaomoji (which Unicode wouldn't standardize because this is creative, not semantic use) and similar, there are at least hundreds more characters involved.


See also:


Flags

Flags do not have singular codepoints - this would be annoying

Instead, U+1F1E6 through U+1F1FF (from Enclosed Alphanumeric Supplement) is A-Z with the special meaning of "interpret pairs of these as a ISO 3166-1 alpha-2 code[20]"

So, for example, 🇧🇪 is actually the sequence of

U+1F1E7 REGIONAL INDICATOR SYMBOL LETTER B
U+1F1EA REGIONAL INDICATOR SYMBOL LETTER E

Doing it this way means changes in countries is up to parsing, not definition within Unicode.


https://en.wikipedia.org/wiki/Enclosed_Alphanumeric_Supplement

On variation selectors

Some things are defined as emoji, and are likely to be shown as everywhere.

There are also emoji-like representations of more regular letters and symbols that have existed for much longer, like ❤, (and even a few fairly basic ASCII characters that can be shown emoji style, like #️*️0️1️2️3️4️5️6️7️8️9️)

That means you'll want the ability to force some things to emoji style - or to text style.

This is done with U+FE0F (Variation Selector 16) and U+FE0E (Variation Selector 15) respectively. These should be placed immediately after the character.

(variation selectors have existed as codepoints since Unicode 3.2, but were only really introduced in this use in Emoji 11 (2018)(verify))

VS16 is useful when the character might default to text representation. For example:

  • ✅️ and ✅︎ are the same character (U+2705) followed by VS16 and VS15.
  • VS16 is used in Unicode's own emoji-data.txt

On modifiers, and ZWJ Sequences

Unicode also made a point of allowing specific combining characters to modify emoji, altering looks, and sometimes meaning.


A common example is to alter skin tone. For example,

U+1F474  OLDER MAN                           👴
U+1F3FF  EMOJI MODIFIER FITZPATRICK TYPE-6   🏿

you can get a dark-skinned older man 👴🏿. (Fitzpatrick refers to this scale)


Note that U+1F3FF is considered a modifier, so can be used directly.

In many other cases you are acombining characters that would render perfectly well by themselves, so you'ld want a way to get them in isolated sequence, and have a way to say "combine these in a semantic-and-unbreakable way". The latter is why we have ZWJ.


For example, a female school teacher 👩‍🏫 would come from

U+1F469   WOMAN                    👩
 U+200D   ZERO WIDTH JOINER 
U+1F3EB   SCHOOL                   🏫

And a 👨‍❤️‍💋‍👨 kissing couple may come from

U+1F468   MAN                      👨
 U+200D   ZERO WIDTH JOINER        
 U+2764   HEAVY BLACK HEART        ❤
 U+FE0F   VARIATION SELECTOR-16    
 U+200D   ZERO WIDTH JOINER        
U+1F48B   KISS MARK                💋
 U+200D   ZERO WIDTH JOINER 
U+1F468   MAN                      👨


ZWJ sequences only really exist as predefined things, see e.g. emoji-zwj-sequences.txt which so far mostly defined families, a few actions, specific person-role combinations (mostly workers), specific types of cats

Yet but not all combinations following the patterns in there exist, e.g. person+school does not seem to be a gender neutral teacher, and certainly not everything you could think of exists.


Note that

  • VS16 is occasionally necessary in JWZ sequences, here because heavy black heart may default to text style and break(verify).
it seems that in practice, the necessity, or emission from emoji palettes, is not always consistent


Things that may be strange and excitingly unpredictable to you (as it is in unicode in general)

combined width
how many glyphs something gets rendered as
linebreaking within sequences
comparing emoji sequences

Or, according to some,

(╯°□°)╯︵ ┻━┻)


See also

See also:

https://blog.emojipedia.org/emoji-zwj-sequences-three-letters-many-possibilities/

Large-set encodings

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Unicode

  • UTF8
  • UTF16


Chinese:

  • GB2312
    • a.k.a. GBK (more than one version)
    • Convers most of simplified Chinese
    • Note: Windows' Code Page 936 was mostly GB2313/GBK, and effectively transitionary


http://www.yale.edu/chinesemac/pages/character_sets.html

Unicode character input

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Unices

Windows

charmap

Utilities like charmap can be a nice visual way to find specific characters, though many of these utilities are not that easy to use.


Alt and decimal codepoint

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Hold Alt, enter some numpad numbers, release Alt.


These are what people usually mean with 'Alt codes'. A good number of people know about this, but don't know the behaviour behind it, so tend to remember a few specific codes at best. It's still around because of such use.


Seems to be part of windows form behaviour so works in most applications' text fields (won't work in applications that have added/replaced their own behaviour, and you can't always guess which(verify))

Has been copied by some other OS GUIs / input methods.


Without an initial zero you pick decimal numbers, and context determines what from, often one of:

Which it is seems to have varied over time (and possibly with localisation?).


When you start with a zero you pick decimal from codepage 1252(verify) For example, Alt-0228 points to 228 (decimal) in that codepage, ä


Since this is codepage stuff, it stops at Alt-255 / Alt-0255. You can't use it for unicode.


Alt and hex codepoint

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Must be enabled to work, by setting EnableHexNumpad to 1 in HKCU\Control Panel\Input Method in the registry, and logging in again.


Hold Alt, numpad +, numpad numbers and abcdef, release alt.

Seems to allow anything from U+0000 to U+FFFF


MS Word key sequence

For example, Ctrl' followed by e produces é


Again, this is imitated in other places.

Alt-x, hex codepoint, Alt-x

Works in a few specific apps.


An IME

See also

http://www.fileformat.info/tip/microsoft/enter_unicode.htm

http://en.wikipedia.org/wiki/Unicode_input

See also



Character references:


UTF:

Unicode-related encodings (wikipedia pages):


Other: