Text coding: Difference between revisions

From Helpful
Jump to navigation Jump to search
(15 intermediate revisions by the same user not shown)
Line 4: Line 4:
{{info|I am ignoring a bunch of history that I am aware of, and probably ''more'' that I am not aware of'''
{{info|I am ignoring a bunch of history that I am aware of, and probably ''more'' that I am not aware of'''


The point here is not completeness, but
: to give you some context,
: a "this is how we used to do it, and this is why we (mostly) stopped doing it that way",
: ...and then skip to a view for modern programmers, probably the only people who would really care much about these details to start with.


The point here is not completeness, but to give you some context,
If you're interested in the details, there are better and longer histories, written by more knowledgeable people.
and a "this is how we used to do it, and this is why we stopped doing it that way (mostly)",
}}


...and then skip to a view for modern programmers, probably the only people who would really care much about these details to start with.


If you're interested in the details, there are better and longer histories, written by more knowledgeable people.
}}
<!--
<!--


Line 17: Line 18:




Computers deal with data as numbers.
Computers deal with data as numbers. At the lowest levels, only.  
The simplest way to represent text is to have a sequence of numbers that each correspond to a character.
(In movies we pretend it's ones and zeroes, but even at assembly level, ''most'' instructions are already a level above that, thinking in [[integers]])
 
 
The simplest way to represent text is to have a sequence of numbers, where each number corresponds to a character.
 
So text is often is an array of integers-that-are-characters.
We tend to call that a string, but under the covers (and in more classical languages, fairly directly),
it's not much different from any other array.


So a string is often an array


: ...of elements of whatever size was natural in context.
...an array of elements of whatever size was natural in context.
:: influenced by the computer's [[word size]] (CPU, memory bus), by serial lines (think teletypes) and punch cards (that's the era we're talking about)
:: influenced by the computer's [[word size]] (CPU, memory bus), by serial lines (think teletypes) and punch cards (that's the era we're talking about)
:: and note that bytes weren't always 8 bits
:: In C, strings of characters were 8-bit things.


: There was usually a one-to-one relation between a number and the character it represented.
While bytes weren't always 8 bits, we can assume that here, and e.g. C strings are arrays of 8-bit characters.
:: Except where not.
:: and  the actual mapping could vary between different design of computer, because of word size, 'not invented here', and more.
:: it could vary between (or translate from) different design of input method. Say, in the hole punch days there were 5- and 6-bit codes
: It's useful to know these exist, but it's now rare to find these even mentioned. I'm ignoring these less-standardized times completely. You'll run into the few that survived anyway.


In a lot of systems there is a one-to-one relation between a number and the character it represented. That makes life simple.
: ...there, are exceptions, because of course there are.
: the actual mapping of which number means which characters might be different from others
:: because of design of computer
:: because of word size
:: because 'not invented here', and more.
:: it could even vary between (or translate from) different design of input method. Say, in the hole punch days there were 5- and 6-bit codes
:: It's useful to know these exist, but it's now rare to find these even mentioned. I'm ignoring these less-standardized times completely. You'll run into the few that survived anyway.




Line 39: Line 51:
Once [http://en.wikipedia.org/wiki/ASCII '''ASCII'''] existed, a lot of systems gravitated around it
Once [http://en.wikipedia.org/wiki/ASCII '''ASCII'''] existed, a lot of systems gravitated around it
: ...there were similar attempts of such a standard from the same time, ASCII just happened to be the most long-lived
: ...there were similar attempts of such a standard from the same time, ASCII just happened to be the most long-lived
: ASCII defined characters for values 0..127 of the by-then-typically-8-bit units.  
: ASCII defined characters for values 0..127 of the by-then-typically-8-bit units.  
:: 95 of these are printable characters, covering English
:: 95 of these are printable characters, covering English
Line 47: Line 60:
===Codepages===
===Codepages===


ASCII covered English speaking countries, but that didn't even cover everyone making computers, let alone the ones using them.
ASCII covered English speaking countries, but that didn't even cover everyone in the then smallish industry of ''making'' computers, let alone the ones using them.
 
Having a standard is useful so we defaulted to ASCII, but half of Europe ''immediately'' had the problem that their alphabets were incomplete.
Having a standard is useful so we defaulted to ASCII, but half of Europe ''immediately'' had the problem that their alphabets were incomplete.


Line 56: Line 70:
'''Codepages''' {{comment|(a term apparently coined by someone at IBM. There are other names, and some similar-but-not-identical concepts)}}
'''Codepages''' {{comment|(a term apparently coined by someone at IBM. There are other names, and some similar-but-not-identical concepts)}}
define characters for that undefined range.  
define characters for that undefined range.  
They may leave the lower range alone entirely, they may redefine some, or even redesign them aggressively.
They may leave the lower range alone entirely, they may redefine some, or sometimes even reassign them aggressively.


 
...which could be unambiguous within specific contexts.
...which could be unambiguous within specific contexts. Text printers had some specifically defined alternatives (with some fancy lines/symbols), and can be told to switch, so this functions as an extra predefined font
Text printers had some specifically defined alternatives (with some fancy lines/symbols), and can be told to switch, so this functions as an extra predefined font.


You hear a but coming.
You hear a but coming.
Line 67: Line 81:
'''Codepages are bothersome''', though, for a handful of reasons..
'''Codepages are bothersome''', though, for a handful of reasons..


For one, they only give us 256 total characters.  
For one, they only give us 256 total characters ''at most''.
Even if we wanted to make all the added definitions into one standard, we wouldn't have the space.
Even if we wanted to make all the added definitions into one standard, we wouldn't have the space.


Which led there being a distinct codepage for every use - each language area, fancy printing, etc.
Which led there being a distinct codepage for every use - each language area, fancy printing, etc.
So it turns out any one code page is only good for one or two uses / countries at a time.
It turns out any one code page is only good for one or two uses / countries at a time.
 
And because people like to argue, there was usually more than one codepage any one specific need.
And because people like to argue, there was usually more than one codepage any one specific need.
And quite regularly more than two.
And quite regularly more than two.




But more importantly, you have to know which codepage goes with what data, both for unambiguous communication, and for correct display.  
Which wouldn't even be that bad if things were clear, but you had to somehow extrnally ''know'' which codepage any data was using, both for unambiguous communication, and for correct display.  


Which might not be so bad if documents had a standard way of mentioning the ''one'' codepage it gets to use.
And most documents either had a kludgey way at best, or just no way, of mentioning the ''one'' codepage it uses.
But we didn't. We never defined a standard way of doing that, and most of us were fuzzy on how it worked exactly.
We never defined a standard way of doing that, and most of us were fuzzy on how it worked exactly.
So for plain text we're still stuck with best guesses fifty years later.
So for plain text we're still stuck with best guesses fifty years later.


 
Not that it would be a total solution. Say, in the DOS days, the codepage you used was a system-wide thing (presumably because in text mode, it was the effectively a graphics hardware setting because you would need to load the right font for it). Which meant you could use only display one correctly at a time. Everything in another codepage would not display correctly. You could switch, but that's an awkward solution, and pretty technical, and you had to understand this situation to start with.
Even at the time it wouldn't even have been a total solution just to know the codepage - in the DOS days, the codepage you used was a system-wide thing (presumably because in text mode, it was the effectively a graphics hardware setting because you would need to load the right font).
Which meant you could use only one at a time.  
Everything in another codepage would not display correctly.  




Line 437: Line 449:
=Unicode=
=Unicode=


==What unicode is==


tl;dr: Unicode:
tl;dr: Unicode:
* Unicode itself is a character set, an map of codepoints to characters.
* Unicode itself is a character set, an map of codepoints to characters.
: intends to code characters, not glyphs.
: intends to code characters, not glyphs (see also [[#What_unicode_isn't|what unicode isn't]], below)
 
* ...plus a bunch of cases where codepoints ''combine'' to show fewer characters - see also [[#Grapheme_clusters|grapheme clusters]]
 
* Unicode covers the script of all known live languages, many dead ones, and a number of special purposes
: (it's starting to get into [[semiotic]] questions - what sort of signs are understood even if they're not language, what do we do with fictional languages, etc.)


* Unicode defines a number of different byte codings to use in transmission
* also, unicode defines a number of different byte codings to use in transmission
: Perhaps unicode's most useful property is the most boring one: that its interpretation does not depending on context, or serialization.
: Perhaps unicode's most useful property is the most boring one: that its interpretation does not depending on context, or serialization.


* Unicode covers the script of all known live languages, many dead ones, and a number of special purposes
: (it's starting to get into semiotic questions - what sort of signs are understood even if they're not language)




Line 514: Line 531:
* http://utf8everywhere.org/
* http://utf8everywhere.org/
-->
-->


==What unicode isn't==
==What unicode isn't==
<!--
<!--


'''Unicode is (itself) not an encoding'''
'''Unicode is (itself) not an encoding''' (as such) - Unicode is a character set first.
 
Aside from the fact that Unicode is much wider (codepoint definitions, information about the characters, how they related, how to process them, multiple ways to communicate codepoints)
 
Unicode is a character set first.
 
Unicode then defines a handful of encodings.  


So there is no ''singular'' encoding called Unicode, and saying 'something is encoded in Unicode' is confusing and borderline incorrect.  
It ends up defining a bunch of encodings, but there is no ''singular'' encoding called Unicode,  
and saying 'something is encoded in Unicode' does not mean a lot, and is borderline incorrect.  
: ...that said, specifically UTF-8 and or UTF-16 are ''common'' for storage and transfer.
: ...that said, specifically UTF-8 and or UTF-16 are ''common'' for storage and transfer.


Line 536: Line 547:
It briefly was, actually. The first version, in 1989, thought that 64K would be enough for anyone.
It briefly was, actually. The first version, in 1989, thought that 64K would be enough for anyone.


This was change fairly quickly, but not before some OSes implemented it as 16-bit things, what you would call UCS2.
This was changed fairly quickly, but not before some OSes implemented it as 16-bit things, what you would call UCS2.




Line 609: Line 620:


-->
-->


==U+ denotation==
==U+ denotation==
Line 965: Line 975:




Unicode's '''normal forms''' are mainly meant to compare codepoints as meaning the same thing (semantically and/or visually),
Or, what's with this '''equivalence''' and '''decomposition''' stuff?
even if they are not the same codepoints, and to that end defines two types of equivalence: canonical equivalence, and compatibility equivalence.


As almost a side effect, this is also useful to normalize strings in some specific ways.


Unicode's '''normal forms''' are mainly meant to compare whether codepoints (and strings containing them)
mean the same thing semantically,
and/or seem the same thing visually,
even when they are not the same codepoints.


: '''Equivalence'''
To that end, it
* defines two types of equivalence: canonical equivalence, and compatibility equivalence.
* lets you decompose characters/strings to what it considers its component parts


'''Canonical equivalence''' is semantic equivalence, relatively strictly so,
As almost a side effect of the last, this is also useful to normalize strings in some specific ways,
and used mainly to '''test things that may show up in more than one way and are interchangeable'''.
sometimes to remove [[diacritics]], and other things -- but because this is not its actual aim,
(This usually also implies both will be rendered identically)
you should assume Unicode does neither of those things well.


Consider &eacute;. It can show up in both of the following ways:
 
* &#xE9; is a LATIN SMALL LETTER E WITH ACUTE ({{unicode|00E9}})
=====Equivalence=====
 
'''Canonical equivalence''' is mainly used to test things that '''test things that may show up in more than one way and are interchangeable'''.
 
Interchangeable in the sense of ''semantic'' equivalence: does it ''mean'' the same thing.
 
This usually also implies both will be rendered identically.
 
 
 
Consider &eacute;. It can show up in both of the following ways:
* &#xE9; is a LATIN SMALL LETTER E WITH ACUTE ({{unicode|00E9}})
OR
* &#x65;&#x301; which looks the same but is actually two codepoints:
* &#x65;&#x301; which looks the same but is actually two codepoints:
** LATIN SMALL LETTER E ({{unicode|0065}}) followed by  
** LATIN SMALL LETTER E ({{unicode|0065}}) followed by  
Line 989: Line 1,015:
'''Compatibility equivalence''' are for things you might want to reduce to the same thing, but are not interchangeable.
'''Compatibility equivalence''' are for things you might want to reduce to the same thing, but are not interchangeable.


Now consider the roman numeral &#x2163; as a single character, {{unicode|2163}}.
Consider the roman numeral &#x2163; as a single character, {{unicode|2163}}.
 
In ''some'' contexts it will be useful to be able to consider it the same as the basic-two-characters string "IV" - yet you would not want to generally consider the letters IV to be the roman numeral. (e.g. SHIV is an all-capsed word, that does not mean SH5)
 


In ''some'' contexts it will be useful to be able to consider it the same as the basic-two-characters string "IV" - yet you would not want to generally consider the letters IV to be the roman numeral.




Note that since  
Note that since  
canonically equivalent is "basically the same thing",  
: canonically equivalent is "basically the same thing",  
and compatibility equivalence is more of a fuzzy 'does this look roughly the same',
: compatibility equivalence is more of a fuzzy 'does this look roughly the same',
anything that is canonically equivalent is implicitly also compatibility equivalent.
...anything that is canonically equivalent is implicitly also compatibility equivalent.
 
And not the other way around, so canonical equivalence is a symmetric relation, and compatibility is not.
: It is, for example, not at all a general truth that every sequence of IV should become the roman numeral.


And not the other way around, so canonical equivalence is a [[symmetric relation]], and compatibility is not.






: '''Decomposition and composition - and tests using them'''
=====Decomposition and composition (and tests using them)=====


Canonical decomposition and canonical composition will only separate and (re)combine things that are semantically equivalent.
Canonical decomposition and canonical composition will only separate and (re)combine things that are semantically equivalent.
Line 1,029: Line 1,055:




When you want  
When you want to
* sort more sensibly
* sort more sensibly
* index text more sensibly
* index text more sensibly
* comparison to be semantic without doing this transform each time
* comparison to be semantic and/or fuzzy without doing this transform each time
...then you choose to e.g. run either composition or decomposition over all your text at input time.
...then you choose to e.g. run either composition or decomposition over all your text at input time.




There are further operations where normal forms are involved, like  
There are further operations where normal forms are involved, like  
* removing certain accents,
* removing certain accents (not really a designed feature, more of a 'you can often get decently far')


* [https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block) halfwidth and fullwidtth forms] include a copy of the basic latin alphabet, mostly for lossless conversions between this and older encodings containing both halfwidth and fullwidth characters {{verify}}
* [https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block) halfwidth and fullwidtth forms] include a copy of the basic latin alphabet, mostly for lossless conversions between this and older encodings containing both halfwidth and fullwidth characters {{verify}}
Line 1,163: Line 1,189:
https://unicode.org/reports/tr29/
https://unicode.org/reports/tr29/


-->
====Grapheme clusters====


=====Grapheme clusters====
<!--




Grapheme clusters seem to refer to any combination of characters that is more complex than adding accent marks.


A [[grapheme]], in general, is a semantically indivisible written unit. If you split it up, it would lose said meaning.


Unicode having a single code points for common character-accent combinations -- precomposed forms -- is in part a backwards compatibility thing, and is not strictly necessary and arguably just confusing.
In Unicode, a grapheme might be composed from multiple codepoints, so Unicode points calls this 'grapheme clusters',
also to point out that this gets more interesting than 'adding a combining accent mark to a base character'
 
 
This brings in an issue: splitting a string without considering this might break the correctness of what is displayed,
so you want to only split on '''grapheme cluster boundaries''', not on any codepoint.
 
As such, grapheme clusters are addressed in [[#Unicode_Text_Segmentation|Unicode Text Segmentation]].


That said, those do avoid one issue, the fact that splitting them within a cluster will break the correctness of what is displayed, and you want to split on '''grapheme cluster boundaries'''.
As such, grapheme clusters are important to string operations like slicing.




In a wider sense, grapheme clusters include
In a wider sense, grapheme clusters include
* base characters plus combining marks,  
* base characters plus combining marks,  
:: {{comment|Side note: Unicode also has a single codepoint -- precomposed forms -- for some common character-plus-accent combinations. The point seems to be backwards compatibility, but note this is not strictly necessary and arguably just confusing because it makes for multiple ways to show the same grapheme. They avoid the 'arbitrary codepoint slice will break the grapheme', though}}
* surrogate pairs
* surrogate pairs
* Hangul jamo
* Hangul jamo
Line 1,183: Line 1,217:
...and more
...and more
https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html
https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html




"Is there a complete list of grapheme clusters?"
"Is there a complete list of grapheme clusters?"


Not really. Much of emoji happens to be defined in a list (zwj sequences),
Not really.  
but not all, and in theory, you can add combining marks on anything, an̯͏d̴̲̑ ̨͏͏̊︣ą̵͚̥͜n̠̭̟̠̞̊͒᷅̃ý̵̖̱̠̦̑͘ ̴̴͍̗͎̖᷁͟a̻͔︠͜ḿ̷̢͕̝̤̖o︢u̵͒̌̀᷇n̵̸̗̄t̖ ̛̜o̫f͓̓ ̋t̵̤i̵᷿̐̂m̢͎̝̮᷀︢e̹⃰̃̍̾s̴̮̲̝̣͏͗̆᷉̐.
Some sets are fairly closed, e.g. both single-codepoint and zwj-sequence emoji.
But others not so much,  
and in theory, you can add combining marks on anything, an̯͏d ̨͏͏ą̵͚̥͜n̠̭̟̠ý̵̖̱̠̦̑͘ ̴̴͍̗͎aḿ̷̢͕̝̤̖ou̵͒̌n̵̸̗̄t̖ ̛̜o̫f͓̓ ̋t̵̤i̵᷿̐̂m̢͎̝̮᷀︢e̹⃰̃̍̾s̴̮̲̝̣͏͗̆᷉̐.
 
 








legacy grapheme clusters
extended grapheme cluster




Line 1,199: Line 1,241:
For example, this includes U+093F ( ि ) DEVANAGARI VOWEL SIGN I. The extended grapheme clusters should be used in implementations in preference to legacy grapheme clusters, because they provide better results for Indic scripts such as Tamil or Devanagari in which editing by orthographic syllable is typically preferred. For scripts such as Thai, Lao, and certain other Southeast Asian scripts, editing by visual unit is typically preferred, so for those scripts the behavior of extended grapheme clusters is similar to (but not identical to) the behavior of legacy grapheme clusters.  
For example, this includes U+093F ( ि ) DEVANAGARI VOWEL SIGN I. The extended grapheme clusters should be used in implementations in preference to legacy grapheme clusters, because they provide better results for Indic scripts such as Tamil or Devanagari in which editing by orthographic syllable is typically preferred. For scripts such as Thai, Lao, and certain other Southeast Asian scripts, editing by visual unit is typically preferred, so for those scripts the behavior of extended grapheme clusters is similar to (but not identical to) the behavior of legacy grapheme clusters.  


Do grapheme clusters


-->
-->


====Other Unicode encodings====
====Other Unicode encodings====
Since Unicode is its own definition and relatively new, it is mostly its own UTF, and other encodings created after Unicode was introduced,
that are made to encode Unicode characters.


Say, GB 18030 is a Chinese standard which can encode all Unicode planes {{verify}}.
Unicode was its own thing when introduced, and the web now largely uses UTF-8.
 
 
There are other encodings, mostly created after Unicode was introduced,
that can also encode Unicode characters, not least of which is GB 18030,
a Chinese standard which can encode all Unicode planes {{verify}}.




Yet most (e.g. looking at the list and excluding UTF-8, UTF-16, arguably UTF-32, and GB18030) seem to serve a specific purpose, including:
If you consider a fuller list (and excluding UTF-8, UTF-16, arguably UTF-32, and GB18030),
then most seem to serve a specific purpose, including:


Compressed formats:
Compressed formats:
* SCSU (Standard Compression Scheme for Unicode) is a coding that compresses Unicode, particularly if it only uses one or a few blocks.
* SCSU [https://en.wikipedia.org/wiki/Standard_Compression_Scheme_for_Unicode] (Standard Compression Scheme for Unicode) is a coding that compresses Unicode, particularly if it only uses one or a few blocks.
* BOCU-1 (Binary Ordered Compression for Unicode) is derived from SCSU
* BOCU-1 [https://en.wikipedia.org/wiki/Binary_Ordered_Compression_for_Unicode] (Binary Ordered Compression for Unicode) is derived from SCSU


These have limited applicability.
These have limited applicability.
They are not too useful for many web uses, since for content larger than a few kilobytes, generic compression algorithms compress better.
They are not too useful for many web uses, since for content larger than a few kilobytes, generic compression algorithms compress better.


SCSU cannot be used in email messages because it allows characters that should not appear inside MIME.  
SCSU cannot be used in email messages because it allows characters that may not appear inside MIME.
BOCU-1 ''could'', but won't until clients widely decide to support it and they have little reason to these days.
BOCU-1 ''could'', but won't until clients widely decide to support it and they have little reason to these days.




Codings for legacy compatibility:
Codings for legacy compatibility:
* UTF-7 does not use the highest bit and is safe for ancient 7-bit communication lines
* UTF-7 [https://en.wikipedia.org/wiki/UTF-7] does not use the highest bit and is safe for (ancient) 7-bit communication lines
* UTF-EBCDIC
* UTF-EBCDIC [https://en.wikipedia.org/wiki/UTF-EBCDIC]




(TODO: look at:)
(TODO: look at:)
* CESU-8  
* CESU-8 [https://en.wikipedia.org/wiki/CESU-8]
* Punycode
* Punycode [https://en.wikipedia.org/wiki/Punycode]


==More practical notes==
==More practical notes==
Line 1,285: Line 1,329:


-->
-->


====On Asian characters====
====On Asian characters====
Line 1,291: Line 1,339:
Because Chinese, Japanese, and Korean characters use [[logograms]], there are a ''lot'' of them.
Because Chinese, Japanese, and Korean characters use [[logograms]], there are a ''lot'' of them.


Additionally, there is a lot of overlap in the history and meaning of these charactets in these languages, and when talking about them, we often refer to that topic as CJK for short - sometimes CJKV, also including Vietnamese.
Additionally, there is a lot of overlap in the history and meaning of these characters in these languages,
and sometimes divergence.  


Because they are tied to a degree, there are some discussions specific to the CJK set (sometimes CJKV, also including Vietnamese).


Initially there was space for approx 28K such characters - an amount that is now all filled.
 
When we had only the BMP, there was space for approx 28K such characters - an amount that is now all filled.


It was later extended with the Supplemental Ideographic Plane (SIP), space for another 47K characters.
It was later extended with the Supplemental Ideographic Plane (SIP), space for another 47K characters.
: Note that SIP does not lie within UCS2, which can be a problem if your program uses an ''old'' UCS2 implementation of unicode (2-byte unicode can still work if they're actually UTF-16)
: Note SIP does not lie within UCS2, which can be a problem if your program uses an ''old'' UCS2 implementation of unicode (2-byte unicode can still work if they're actually UTF-16, not UCS2)




Line 1,303: Line 1,354:
There are also... less-technical issues.  
There are also... less-technical issues.  


For context, historically characters are adopted largely from Chinese to the others mentioned, and also a few in the other direction.
For context, historically characters are adopted largely from Chinese to the others mentioned, but also some in the other direction.


Much of that adoption was so long ago they can have have their own language and writing style and are basically now their own thing,
 
though depending on case, it might also just have the exact same meaning.
Much of that adoption was so long ago that the language has its own writing style and some characters may have since changed in meaning.


It turns out there is a large set that is the same or ''close'', and a small set that is distinct,  
It turns out there is a large set that is the same or ''close'', and a small set that is distinct,  
so Unicode decided that if the character carries the same meaning as its origins, and differs only in writing style (plus some further requirements), it becomes a single character (via Unihan, {{comment|(where 'Han' is a reference to chinese origin, to [http://en.wikipedia.org/wiki/Hanzi hanzi] in Chinese, [http://en.wikipedia.org/wiki/Kanji kanji] in Japanese, [http://en.wikipedia.org/wiki/Hanja hanja] in Korean, [http://en.wikipedia.org/wiki/Chu_han chu han] in Vietnamese.)}})
so Unicode decided that if the character carries the same meaning as its origins, and differs ''only'' in writing style (plus some further requirements), then it is effectively still the same in the three languages, and is coded as a single character (via Unihan, {{comment|(where 'Han' is a reference to chinese origin, to [http://en.wikipedia.org/wiki/Hanzi hanzi] in Chinese, [http://en.wikipedia.org/wiki/Kanji kanji] in Japanese, [http://en.wikipedia.org/wiki/Hanja hanja] in Korean, [http://en.wikipedia.org/wiki/Chu_han chu han] in Vietnamese.)}}), and if it doesn't meet those requirements, it is coded as separate characters.




Line 1,315: Line 1,366:
This makes sense from Unicode's point of view,  
This makes sense from Unicode's point of view,  
as it focuses on semantics,  
as it focuses on semantics,  
and on storing things,
but has been controversial in a sociological, cultural, and sometimes practical view.
but has been controversial in a sociological, cultural, and sometimes practical view.


Line 1,335: Line 1,385:


...but no single font can view all at the same time, so you'ld need multi-font typesetting and language marking to actively mix them within the same document. {{comment(''Technically'' speaking, Unicode allows language tagging using Unicode codepoints, but also discourages it in general use, and it's unlikely fonts could and would really deal with this. This is more a detail to typesetters and web developers working on international sites, though)}}.
...but no single font can view all at the same time, so you'ld need multi-font typesetting and language marking to actively mix them within the same document. {{comment(''Technically'' speaking, Unicode allows language tagging using Unicode codepoints, but also discourages it in general use, and it's unlikely fonts could and would really deal with this. This is more a detail to typesetters and web developers working on international sites, though)}}.
-->
-->


Line 1,697: Line 1,748:




When you restrict yourself to countries, things stay relatively simple.  
When you restrict yourself to countries, things stay relatively simple. Sort of.


Unicode 6 added an range of letters used only as regional indicators (U+1F1E6 through U+1F1FF, within [https://en.wikipedia.org/wiki/Enclosed_Alphanumeric_Supplement  Enclosed Alphanumeric Supplement]).
Unicode 6 added an range of letters used only as regional indicators (U+1F1E6 through U+1F1FF, within [https://en.wikipedia.org/wiki/Enclosed_Alphanumeric_Supplement  Enclosed Alphanumeric Supplement]).


It comes down to "use these in pairs to refer to [[Language_codes,_country_codes#Countries|ISO 3166-1 alpha-2 code]]"[https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Officially_assigned_code_elements].  
It comes down to "use these in pairs to refer to [[Language_codes,_country_codes#Countries|ISO 3166-1 alpha-2 code]]"[https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Officially_assigned_code_elements].  
: (spoilers: except exceptions)


So, for example, &#x1F1E7;&#x1F1EA; is actually the sequence of  
So, for example, &#x1F1E7;&#x1F1EA; is actually the sequence of  
Line 1,708: Line 1,761:


Doing it this way means  
Doing it this way means  
: there's just one authoritative place to keep up to date.
* there's just one authoritative place to keep up to date.
: changes in countries is up to implementations, not definition within Unicode.
* changes in countries is up to implementations following changes to ISO 3166, not definition within Unicode.
: ...flags can show up differently in a lot of places.
* ...flags can actually show up differently in a lot of places.




Line 1,717: Line 1,770:
One practical question is "where do you stop with regions?"
One practical question is "where do you stop with regions?"


The above is an answer, but not a fully satisfying one, shown e.g. by the UK.
The above is ''an'' answer, but not a fully satisfying one.
 
Consider the UK.  
<!--
The UK is complex to start with.
 
And it is also funny in the ISO 3166-1 alpha-2 code list.
 
Sure, GB is the United Kingdom of Great Britain and Northern Ireland, and its flag is &#x1F1EC;&#x1F1E7;
 
 
But this is one example where the parts of that state are self-governing enough that they ''certainly'' use their own flag, which they also had before they joined.  


The UK is complex to start with. Sure, in the above list, GB is the United Kingdom of Great Britain and Northern Ireland, and its flag is &#x1F1EC;&#x1F1E7;
England, Scotland, and Wales are not countries according to that ISO list -- but the [https://en.wikipedia.org/wiki/Isle_of_Man Isle of Man] ''is'' (IM, ????????). But e.g. [https://en.wikipedia.org/wiki/Tristan_da_Cunha Tristan da Cunha] is ''not'' (but TA ???????? works, despite that merely being ''reserved'', not formal in that ISO{{verify}}). England, Scotland, and Wales may find that unfair, and the footnotes in variations of types of "region that the UK is responsible for (because it is)" will ''certainly'' be lost of people like me, but probably also many brits.


But this is one example where the parts of that state are self-governing enough that they certainly use their own flag, which they also had before they joined.  England, Scotland, and Wales are not countries according to ''that'' list, anyway. But the Isle of Man ''is''. Yet e.g. Tristan da Cunha is not.
{{comment|(As a side note, ISO ISO_3166-1 alpha-2 makes GB the preferred choice and [https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Decoding_table reserves UK] apparently to avoid confusion, but actually, things like ccTLDs an the European Commission, which mostly generally follow ISO 3166-1 alpha-2, deviate from that by using UK, not GB.)}}
The reason that one region that the UK is responsible for (because it is) is so different from another will be lost on most people.
 
(...yet) with unicode, assume that you'll be using GB (&#x1F1EC;&#x1F1E7;), not UK (&#x1F1FA;&#x1F1F0;). The following exceptions also do:


<!--
And these three happened to be added to Emoji 5.


The English flag would be
The English, Scottish, and Welsh flag happen to be added to Emoji 5 - so don't use the same mechamism, but specifically registered emoji sequences -- a system which seems to have originated in unofficial use but might maybe become a little more settled?{{verify}}:
&#x1F3F4;&#xE0067;&#xE0062;&#xE0077;&#xE006C;&#xE0073;&#xE007F;
The English flag: &#x1F3F4;&#xE0067;&#xE0062;&#xE0065;&#xE006E;&#xE0067;&#xE007F;
  U+1F3F4  WAVING BLACK FLAG
  U+1F3F4  WAVING BLACK FLAG
  U+E0067  TAG LATIN SMALL LETTER G
  U+E0067  TAG LATIN SMALL LETTER G
Line 1,735: Line 1,800:
  U+E0067  TAG LATIN SMALL LETTER G
  U+E0067  TAG LATIN SMALL LETTER G
  U+E007F  CANCEL TAG
  U+E007F  CANCEL TAG
The scottish
Scottish: &#x1F3F4;&#xE0067;&#xE0062;&#xE0073;&#xE0063;&#xE0074;&#xE007F;
  U+1F3F4  WAVING BLACK FLAG
  U+1F3F4  WAVING BLACK FLAG
  U+E0067  TAG LATIN SMALL LETTER G
  U+E0067  TAG LATIN SMALL LETTER G
Line 1,743: Line 1,808:
  U+E0074  TAG LATIN SMALL LETTER T
  U+E0074  TAG LATIN SMALL LETTER T
  U+E007F  CANCEL TAG
  U+E007F  CANCEL TAG
 
Welsh: &#x1F3F4;&#xE0067;&#xE0062;&#xE0077;&#xE006C;&#xE0073;&#xE007F;
...a system which seems to have originated in unofficial use.
U+1F3F4  WAVING BLACK FLAG
 
U+E0067  TAG LATIN SMALL LETTER G
U+E0062  TAG LATIN SMALL LETTER B
U+E0077  TAG LATIN SMALL LETTER W
U+E006C  TAG LATIN SMALL LETTER L
U+E0073  TAG LATIN SMALL LETTER S
U+E007F  CANCEL TAG




http://unicode.org/review/pri299/pri299-additional-flags-background.html
http://unicode.org/review/pri299/pri299-additional-flags-background.html
-->




 
'''This also runs right into politics.'''
This also runs right into politics.
   
   
So, a windows or an Apple can decide to include the Israelian flag and Palestinian flag. Might be a bit contentious,
A Microsoft or an Apple can, say, decide to include the Israelian flag and Palestinian flag.  
Might be a bit contentious depending on where you are at the moment,
but you can defer to some type of neutrality this way.
but you can defer to some type of neutrality this way.


But what to think about Apple removing the ability to render the Taiwanese flag ''only if the locale is mainland china''? The reason is clearly that the flag can symbolize Taiwanese independence, which the Chinese government is not a fan of - but Apple still wants to sell iPhones to a billion chinese people, sooo.... You now probably have opinions on the ethical stance that Apple is not taking there.  
But what to think about Apple removing the ability to render Taiwanese flag (TW, &#x1F1F9;&#x1F1FC;) ''only if the locale is mainland China''?  
The reason clearly seems that the flag can symbolize Taiwanese independence, which the Chinese government is, ahem, not a fan of - but Apple still wants to sell iPhones to a billion chinese people, soooo.... You now probably have opinions on the ethical stance that Apple is not taking there.  


Windows seems to have had, and then specifically dropped flag support is lacking. Which might well be related.
Windows seems to have specifically dropped flag support (browsers have their own implementation because windows does not).
: Browsers have their own implementation because windows does not.
Which might well be related to such politics.




Unicode just side-stepped such issues issue in the first place by basically saying "we give you a system, you decide how to use it"
Unicode just side-stepped such issues issue in the first place by basically saying "we give you a system, you decide how to use it"


<!--


'''What about macro-regions like EU and the UN?'''
'''What about macro-regions like EU and the UN?'''


There are actually in the list, as 'exceptional reservations'[https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Exceptional_reservations]. Not that all of those have flags, e.g. UK does not.
These are actually in the ISO list, but as 'exceptional reservations'[https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Exceptional_reservations].
Not that all of ''those'' have flags, e.g. UK does not.


EU was added in Emoji 1
EU was added in Emoji 1


UN was added in  Emoji 4
UN was added in  Emoji 4
Line 1,777: Line 1,852:
'''What about non-region flags - the pirate flag? The white flag? A checkered flag? The pride flag?'''
'''What about non-region flags - the pirate flag? The white flag? A checkered flag? The pride flag?'''


U+2690 WHITE FLAG
U+2690 WHITE FLAG
U+2691 BLACK FLAG


U+1F3F3 WAVING WHITE FLAG
U+2691 BLACK FLAG
U+1F3F4 WAVING BLACK FLAG


U+1F3C1 CHEQUERED FLAG  
U+1F3F3 WAVING WHITE FLAG


The rainbow flag (added in Emoji 4)
  U+1F3F4 WAVING BLACK FLAG
  U+1F3F3


U+1F3C1 CHEQUERED FLAG
&#x1F3F3;&#xFE0F;&#x200D;&#x1F308; (added in Emoji 4) is:
U+1F3F3 WAVING WHITE FLAG
U+FE0F ️VARIATION SELECTOR-16
U+200D ‍ZERO WIDTH JOINER
U+1F308 RAINBOW
-->


=Unsorted=
=Unsorted=
Line 1,844: Line 1,924:


<!--
<!--
Many languages write text left to right, others right to left.


Right-to-left often refers to '''right to left, top to bottom'''
Right-to-left often refers to '''right to left, top to bottom'''
Line 1,852: Line 1,935:
: characters are stacked under the last, columns continue left of the last
: characters are stacked under the last, columns continue left of the last
: classical Chinese, Japanese, and Korean writing are classically written like this -- though these (computerized) days they are often left-to-right  
: classical Chinese, Japanese, and Korean writing are classically written like this -- though these (computerized) days they are often left-to-right  


The '''left-to-right''' which is almost exclusively usually top-to-bottom order is common.
The '''left-to-right''' which is almost exclusively usually top-to-bottom order is common.
Line 1,861: Line 1,943:
'''In computers'''
'''In computers'''


Before unicode, any program would tend to deal with just rtl ''or'' ltr, but could not mix them - designing that would be complex, and in a lot of applications it was fine not to.
Before unicode, any program might already deal with right-to-left, but would often deal with just rtl ''or'' ltr,  
but could not mix them - designing that would be complex, and few applications needed it.


Unicode tried to address that, in that bidirectonal script support defines enough about how things should be stored, and how that should be displayed, that it becomes much easier to work with, and now is typically done for you.
Unicode tried to address that, in that [https://en.wikipedia.org/wiki/Bidirectional_text bidirectonal script support]
defines enough about how things should be stored,  
and how that should be displayed,  
that it becomes much easier to work with.


Mixing them is now rendered for you pretty well -- though there are a lot of sequences that really make no sense to actually do,
(actually doing those will look... interesting, but is arguably irrelevant because aside from hackerthink, no one expects that to do anything useful).


In unicode strings, characters should appear in the order they would be interpreted, not in the direction they would be displayed.
 
 
'''In unicode strings, characters should appear in the order they would be written and read/interpreted, not in the direction they would be displayed.'''


To steal wikipedia's example,
To steal wikipedia's example,
Line 1,875: Line 1,965:
  U+05D4 ה HEBREW LETTER HE
  U+05D4 ה HEBREW LETTER HE
in that order.
in that order.




Line 1,882: Line 1,973:
There are some question left over, like what the cursor should do when stepping through these characters,
There are some question left over, like what the cursor should do when stepping through these characters,
and how things change when you mix.
and how things change when you mix.
'''How does it know?'''
''Broadly'', mixing scripts will look more or less right with little to no assistance.
How? The '''bidi algorithm'''.
How, at lower level?
What we have to work with is, roughly
* unicode defining most characters as either
** '''strongly typed as LTR''' (left-to-right)
** '''strongly typed as RTL''' (right-to-left)
** '''weakly typed'''/neutral  - those that could be used in either, such as spaces and a lot of punctuation.
* the content (document, input field, etc) defining a '''base directionality'''
:: e.g. HTML allows specifying a dir=
* (there is actually [https://en.wikipedia.org/wiki/Bidirectional_text#Table_of_possible_BiDi_character_types a bunch more to bidi typing], but we'll start with that)
Bidi mostly splits series of characters into sections of LTR and RTL.
While you can see how this eases rendering later, this is still just a semantic step at this point
{{comment|(it's not actually ''enough'' to define how it should be rendered)}}.
When you have only strongly typed characters, this split is fairly simple to understand.
Weakly typed characters can be used in either, and are assigned a direction to weak chacaters according to the bidi algorithm.
''Roughly speaking'', they extend the section of whatever directionality it's sitting on.{{verify}}
...but they will often be sitting between two strongly typed codepoints of different directionality,
so isn't that ambiguous?
Yup. Roughly speaking it extends in the direction of the base directionality.{{verify}}
(not quite how it works, but it can be easier to think of it as extending
because there are often multiple weak characters, e.g a comma and a space.
Consider:
English space Arabic comma space English
in LTR base directionality might become{{verify}}
[English space] [Arabic comma space] [English]
in RTL base directionality might become{{verify}}
[English] [space Arabic] [comma space English]
This is at least well defined, but not necessarily what you wanted, and
while you can direct it, you now have to know what it's doing.
Note that where that comma will eventually be rendered depends on the context.
Note that numbers are handled as weakly typed -- so become part of what is semantically before{{verify}}
This is sometimes great, and sometimes messy. To steal the example from [https://www.w3.org/International/articles/inline-bidi-markup/uba-basics], if your text was
[RTL name] - 5 stars
then (in LTR context) the just-mentioned logic would stick ' – 5' on the RTL text and probably render it as:
- 5 RTL name stars
Forcing directionalty
* in HTML, it may be more readable to define elements with the sole
* in bare unicode, there are characters to help.
U+200E LEFT-TO-RIGHT MARK          (LRM)
U+200F RIGHT-TO-LEFT MARK          (RLM)
U+202A  LEFT-TO-RIGHT EMBEDDING    (LRE)
U+202B  RIGHT-TO-LEFT EMBEDDING    (RLE)
U+202C  POP DIRECTIONAL FORMATTING  (PDF)
U+202D  LEFT-TO-RIGHT OVERRIDE      (LRO)
U+202E  RIGHT-TO-LEFT OVERRIDE      (RLO)
U+2066 LEFT-TO-RIGHT ISOLATE        (LRI)
U+2067 RIGHT-TO-LEFT ISOLATE        (RLI)
U+2068 FIRST STRONG ISOLATE        (FSI)
U+2069 POP DIRECTIONAL ISOLATE      (PDI)
'''Embeddings versus isolates'''
Right-to-Left Override
* U+202E, the "Right-to-Left Override"
:: consider the following characters to be considered strong RTL regardless of what they are defined to be
following characters to be considered "strong" in the right-to-left direction until a PDF
https://languagelog.ldc.upenn.edu/nll/?p=4333




-->
-->
https://www.w3.org/International/articles/inline-bidi-markup/uba-basics
http://unicode.org/reports/tr9/

Revision as of 15:19, 14 March 2024

An oversimplified history of text coding

🛈 I am ignoring a bunch of history that I am aware of, and probably more that I am not aware of

The point here is not completeness, but

to give you some context,
a "this is how we used to do it, and this is why we (mostly) stopped doing it that way",
...and then skip to a view for modern programmers, probably the only people who would really care much about these details to start with.

If you're interested in the details, there are better and longer histories, written by more knowledgeable people.




Codepage notes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

General and/or western

  • ASCII
    • a.k.a. X3.4, IEC 646, International Reference Alphabet (IRA) (defined in ITU-T T.50), IA5 (International Alphabet No.5, the older name for IRA), ECMA-6
    • minor details in revisions (consider e.g. X3.4-1968, X3.4-1977, X3.4-1986)
    • Also sometimes called USASCII to distinuish it from more obscure things with ASCII in their name
    • Single-byte. Assigns 0x00 through 0x7F, leaves 0x80-0xFF undefined


  • Latin1, ISO 8859-1
    • one of about a dozen defined in ISO 8859
    • A codepage, in that it extends ASCII to cover about two dozen european languages, and almost covers perhaps a dozen more.
    • was one of the more common around the DOS era(verify)


  • Windows-1252, also codepage 1252, cp1252, sometimes WinLatin1
    • Default codepage for windows in western europe(verify)
    • a superset of Latin1 (which makes WinLatin1 a potentially confusing name)
    • Defines the 0x80-0x9F range (which in Latin1 was technically undefined but often used for C1 control characters) which it uses for some printable characters there instead, including €, ‹ and › (german quotes), „, Œ, œ, Ÿ, ž, Ž, š, Š, ƒ, ™, †, ‡, ˆ, ‰, and some others.

Mostly localized (to a country, or one of various used in a country)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Numbered codepages are often shorthanded with cp, e.g. cp437.

Defined by/for OSes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The list is much longer, but some better known names include:


  • Mac OS Roman
    • the most common Macintosh coding up to and excluding OSX (which adopted UTF-8)
    • Known to IANA under the MACINTOSH name, which can be a little misleading as there are several different Apple codings from about the same time.


  • Windows-1252
    • also known as WinLatin1 (which can be confusing, because it conflicts with Latin1, if only on one point)
    • a superset of the basic ISO/IEC 8859-1 (a.k.a. Latin1) set, though strictly conflicting, in that it uses characters in the C1 set (which are invalid in 8859-1).
  • Windows-1251 / cp1251
basically the cyrillic variant of Windows-1252
  • Windows-1250 / cp1250
basically the eastern european variant of cp1252
  • Windows-1253 / cp1253
basically the Greek variant of cp1252
  • Windows-1254 / cp1254
basically the Turkish variant of cp1252

Semi-sorted

  • The 7-bit character set defined in GSM 03.38 (...section 6.2.1) [1] (sometimes abbreviated 'GSM')



See also (codepages)

Unsorted:

Unicode

What unicode is

tl;dr: Unicode:

  • Unicode itself is a character set, an map of codepoints to characters.
intends to code characters, not glyphs (see also what unicode isn't, below)
  • ...plus a bunch of cases where codepoints combine to show fewer characters - see also grapheme clusters
  • Unicode covers the script of all known live languages, many dead ones, and a number of special purposes
(it's starting to get into semiotic questions - what sort of signs are understood even if they're not language, what do we do with fictional languages, etc.)
  • also, unicode defines a number of different byte codings to use in transmission
Perhaps unicode's most useful property is the most boring one: that its interpretation does not depending on context, or serialization.



More practically

So practically it's a stable list of all the characters there are.

Unicode is probably the broadest character map, and a fairly thoughtful one at that (nothing's perfect - language isn't free of politics, for one, asian characters and Han unification have had hiccups, emoji seem to have a somewhat random bar for entry, and there are a various other points of argument. That said, there is nothing else that comes even close to doing this job well that isn't also extremely local).

Unicode has in fact been used to define some new codepages, has been used to clarify older codepages, particularly their transforms to and from Unicode.



More technically

On inception, Unicode was considered 16-bit, and only later made to be open-ended in theory, though there is a current cap at U+10FFFF (this and some further rough edges actually comes from the same history).

That cap of U+10FFFF allows 1.1 million codepoints, which is plenty - we're using only a small portion, and have covered most current and historical languages:

approx 110K are usable characters
about 70K of those 110K are CJK ideographs
approx 50K are in the BMP (Basic Multilingual Plane), 60K in higher planes (verify)
an additional ~130K are private use (no character/meaning assigned, and won't be; can be used for local convention -- generally not advised to use to communicate anything with meaning, though)
approx 900K within the currently-valid range are unassigned

...and that's after we've covered most languages we know of, so this is likely to last a long while.


The terminology can matter a lot (to the point I'm fairly sure some sentences here are incorrect) because once you get into all the world's language's scripts, you have to deal with a lot of real-world details, from adding accents on other characters, to ligatures, to emoji, to right-to-left languages, to scripts where on-screen glyphs change depending on adjacent characters. (some of that is squarely the domain of fonts and not of characters, but Unicode design sometimes still has to be aware of that)

For example:

  • Characters - have meanings (e.g. latin letter a)
  • A glyph has a style.
Many character can appear as fairly distinct-looking glyphs - consider the few ways you can write an a (e.g. cursive, block). Exactly the same function, but different looks.
  • also, some scripts will change the glyph used for a character in specific sequences of code points
  • consider languages that never separate specific characters (e.g. ch in Czech)
  • the concept of text segmentation for e.g. cursor movement - which is sometimes multiple codepoints - a 'grapheme cluster' becomes a more useful concept, see e.g. UAX #29.


To most of us, the takeaway from that is that codepoints tend to be semantic things - even if most of the time and for most of the world they work out as one glyph put after the other.


On top of the main standard, there are also

UAX (Unicode Standard Annex) are basically parts of the standard that happen to be described in separate documents
for example bidirectional behaviour is a subject useful to detail on its own
UTS (Unicode Technical Standards) are basically things that implementations can optionally conform to,
e.g. collocation, some security considerations, some of the emoji processing, a compressed encoding
UTR (Unicode Technical Reports) are further informative material.
some of these are just helpful,
others are closer to UTSes - For example, Emoji (TR #51) started as an UTR and now is a UTS

more on their status list of them


What unicode isn't

U+ denotation

A particular Unicode codepoint is basically an integer, and typically denoted using the U+ format that uses hexadecimal, and frequently seen left-padded with zeroes.


For example, the character 32 (a space, taken from ASCII) would be or U+20. Or U+0020.


Zeroes are removed when parsed, so how many zeroes to pad with is mostly personal taste/convention. So U+05 is U+0000005 is U+0005 is U+5.


Code units, codepoints, scalar value

In general use (more so when working on unserialized form of unicode, or thinking we are), we tend to just say 'codepoint', and intuitively mean to communicate what is technically scalar values.


In UTF-8 the serialized and unserialized form are distinct enough that there's no confusion.


In UTF-16, confusion happens because

for appox 60K characters, code units code for points directly
And then the other ~50K codepoints they don't
But particularly English speakers deal primarily with that first set.


Code units aren't a uniquely UTF-16 concept, but are the most interesting in UTF-16.

Because UTF-16 had to become a variable-width serialization, it introduced code units that are not valid codepoints and only used to indirectly code for other codepoints.

For example, the sequence U+D83E U+DDC0 is a surrogate pair - two UTF-16 code units, that codes for the single U+1F9C0 codepoint. You will never see U+D83E or U+DDC0 by itself in a UTF-16. Not in a well-formed one, anyway.


But nobody much uses that terminology that precisely, and we often use the same notation we generally associate with codepoints for UTF-16 code units as well, which causes confusion as it seems to blur the lines between encoded form and 'pure', scalar-value codepoints.

And you will run into this a lot in programming.



More technically

Relevant terms - see https://www.unicode.org/glossary/

  • Code Point[2] - Any value in the Unicode codespace; i.e. the range of integers from 0 to 0x10FFFF.
  • Code Unit[3] - "minimal bit combination that can represent a unit of encoded text for processing or interchange"
e.g. UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, UTF-32 uses 32-bit code units
  • Scalar value[4] - all codepoints except surrogates.



Storing, altering

So, say, you now know that U+2164 (Ⅴ) is roman numeral V, so how do you communicate it?


A lot of document formats opt for UTF8, or UTF16.

And are either defined to use that always, or mark what they use within the document.


Some cases have further options. For example most HTML and most XML allow three options: numeric entities like &#233; or &#xE9;, named entities like &eacute; (except it's a smallish list, and XML doesn't define those), and raw bytestrings according to the document encoding, e.g. that example would be 0xC3 0xA9 in UTF8


Dealing with unicode in programming is a more complex topic.

In-memory, UCS4 (just the codepoint as plain integers) is simplest for you to deal with, but there are practical reasons (storage space) and historical reasons (like Windows APIs initially being UCS2 so retrofitting it to UTF-16 was easier) some implementations actually opt for UTF-16 instead.

UTF-16 makes operations altering unicode strings more complex - even just counting codepoints means you have to decode the string. But it's still doable, and done. And you'ld probably be using libraries anyway.


And most operating systems and most programming languages worth mentioning support Unicode at some level, relieving programmers from having to be obsessive-compulsive experts to get it right. Still, you have to know approximately what you're doing.




UCS

UCS (in itself, without a number) refers to the Universal Character Set - abstractly to the character set that Unicode defines.


In programming you more usually see references to UCS2 or UCS4, ways of storing non-encoded codepoints in fixed-size elements (16-bit and 32-bit uints). (and from the perspective of bytes, you are using one of UCS2BE, UCS2LE, UCS4LE, or UCS4BE)

Unicode libraries implementations often use an array of integers, in the machine's native endianness for speedy operations and simplicity, where each integer is either 2 bytes large (UCS2) or 4 bytes large (UCS4) (the latter of which supports all of unicode, the former only the BMP).

Various common operations are much simpler in this format than it is in encoded data (than in UTF), including finding the amount of characters, doing bidirectional movement with a string, comparing two strings, taking a substring, overwriting characters, and such.

Yes, you could back an implementation with e.g. UTF16 instead, which is more memory-efficient, but the logic is a little slower, and much hairier. But it's doable, and done in practice.


UTF

UTF (Unicode Transformation Format) refers to various variable-length encoding of unicode.

The two most common flavours are UTF-8 and UTF-16 (Others exist, like UTF-32 and UTF-7, but these are of limited practical use).


UTF-8 and UTF-16 are both variable-bytes coding, which means only left-to-right movement in these strings is simple, and you technically can't seek arbitrarily without decoding everything before (but for both there are ways to do so if you can assume it is a well-formed string(verify)).

Another implication is that various operations (see above) are more work, and often a little slower than unencoded form.


Note that UTF-16 is a superset of UCS2, and that the main addition of UTF-16 over UCS2 is surrogate pairs.

So it is is backwards compatible -- but creates some footnotes in the process.

Windows before 2000 used UCS2, Windows 2000 and later use UTF-16 -- to support all characters and break (almost) no existing code. That is, interpreting UTF-16 as UCS2 won't fail, it just limits you to ≤U+FFFF.

Older software seeing surrogates would just discard them, or make them U+FFFD) for being non-allocated characters. It won't display, but you can't really expect that from software from a time these character ranges didn't exist. More important is that it doesn't fail.


Limits

UCS2 is limited to U+FFFF

UTF-16 is limited to U+10FFFF (only so many 2-codepoint surrogates)

UTF-8, UCS4, and UTF-32 can go up to at least 231 in theory but, but currently everything keeps under the overall cap imposed via UTF-16:

while UTF-8 as an algorithm could code codepoint values up to 231, UTF-16's currently defined surrogates can only code up to 220.

Since UTF-16 is at the core of a bunch of Unicode implementations, and we haven't remotely filled the codepoints we have, that cap is unlikely to change any time soon.


Space taken

UTF-8 uses

  • 1 byte up to U+7F (which is mostly ASCII as-is, which is intentional)
  • 2 bytes for U+80 through U+7FF
  • 3 bytes for U+800 through U+FFFF
  • 4 bytes for U+10000 through U+10FFFF
  • It was designed to code higher codepoints, with 5-byte and 6-byte sequences, but this is currently outside the cap, and we are unlikely to extend that anytime soon, as we've not used more than ~30% of the space we have.


Note that:

  • Most ASCII is used as-is in Unicode
particularly the printable part, so printable ASCII is itself valid UTF-8 bytes
exceptions are control characters (those below U+20). Which are not really part of ASCII. They exist, though.
  • ...also meaning english encoded in UTF-8 is readable enough by humans and other Unicode-ignorant systems, and could be edited in most text edtors as long as it leaves the non-ASCII bytes alone.
  • All codepoints above U+80 are coded with purely non-ASCII values, and zero bytes do not occur in UTF8 bytes or most ASCII text, meaning that
you can use null-terminated C-style byte strings to store UTF8 strings
C library string functions won't trip over UTF-8, so you can get away with carrying them that way
...just be aware that unicode-ignorant string altering may do bad things


UTF-16 uses

  • 2 bytes for U+0000 through U+FFFF
  • 4 bytes for U+10000 through U+10FFFF (by using pairs of surrogates to code > U+FFFF). Surrogates as currently defined are also the main reason for that U+10FFFF cap)


UTF-32 stores all characters unencoded, making is superficially equivalent to UCS4 (It isn't quite: UCS is a storage-ambivalent enumeration, UTF is a byte coding with unambiguous endianness)


If you want to minimize stored size/transmitted bandwidth, you can choose between UTF-8 or UTF-16 based on the characters you'll mostly encode.

the tl;dr is that

UTF-8 is shortest when you store mostly western alphabetics
because characters under U+800 are coded with 2 bytes, and most western-ish alphabetic languages almost exclusively use codepoints under it - and then mostly ASCII, coded in a single byte, so this will typically average less than 2 bytes per character.
while UTF-16 is sometimes better for CJK and mixed content
UTF-16 uses exactly 2 bytes for everything below U+FFFF, and 4 bytes for everything above that point.
so in theory it'll average somewhere between 2 and 4 for UTF-16
in practice, the most common CJK characters are in the U+4E00..U+9FFF range, so you may find it's closer to 2 -- while in UTF-8 most are 3 or 4 bytes.

On surrogates

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Surrogates appear UTF-16 serialized form, and are used to allow UTF-16 to represent the U+10000 to U+10FFFF range of codepoints.

Surrogates are values in the range 0xD800–0xDFFF (2048 values)


These values should appear only in UTF-16 serialized form.

They should not appear in UCS (i.e. those values are not valid codepoints) -- but the fact that some unicode implementations are UTF-16 rather than UCS4 can make that distinction somewhat vague and confusing in practice, so in the messy real world, this mixup does sometimes happen.


They should also not appear in other coded forms. For example, to UTF-8 it's just numbers it can code perfectly fine, but because of the 'shouldn't appear in UCS', UTF-8 should give errors rather than pass them through - preferable behaviour because double-encoding UTF codings is probably never a good idea.


Surrogate values are split into two ranges of 1024:

High surrogates are in the range of 0xD800 .. 0xDBFF
Low surrogates are in the range of 0xDC00 .. 0xDFFF


Surrogates are only valid in pairs, where the first must always come from the high range and the second always from the low range.

Which each contribute 10 bits to en overall value in a relatively simple way. Meaning surrogates can only code a 20-bit value.

UTF-16 surrogates are used to represent codepoints 0xFFFF+1 to 0xFFFF+0x100000. This seems most of the reason the unicode cap is at U+10FFFF (1114111 in decimal).


The "pair is high then low" lets us detect data that doesn't seem to be UTF-16 at all, but more importantly, invalid use of surrogates.

And yes, that does happen. For example, Windows's APIs don't check for well-formed UTF-16, so will happily pass through ill formed UTF-16 (e.g. from filesystem filenames) that somehow made it past some API that didn't check earlier.

(UTF-16 decoding should throw away lone surrogates that you see in encoded UTF-16, because that's ill-formed unicode. This is fairly simple code due to the way they're defined: every value in 0xDC00 .. 0xDFFF without a value in 0xD800 .. 0xDBFF before it should go away, every value in 0xD800 .. 0xDBFF without a value in 0xDC00 .. 0xDFFF after it should go away)


Notes:

  • The list of unicode blocks has high surrogates split into regular and "private use surrogates"
this seems to just reflect that the last 128 thousand of the ~million codepoints that UTF-16 surrogates can represent fall in planes 15 and 16, Private Use Area A and B


  • One annoying side effect is that any program which absolutely must be robust against invalid UTF-16 will probably want its own layer of checking on top of whatever the language provides.
In theory, you can ignore this because someone else has a bad implementation and them fixing it is best for everyone.
In practice, erroring out on these cases may not be acceptable.


  • You remember that "UTF-8 should not contain surrogates, lone or otherwise?" Yeah, some implementations do deviate
For example
Java uses a slight (but incompatible) variant of UTF-8 called "Modified UTF-8"[5]

[6], used for JNI and for object serialization(verify).

similarly, WTF-8 takes the view that it can be more practical to work around other people's poor API implementations, rather than go "technically they broke it, not my problem, lalalala"
it allows representation of invalid UTF-16 (from e.g. windows) so that it can later ask it for the same invalid string


More technical notes

Unicode General Category

Unicode states that each codepoint has a general category (a separate concept from bidirectional category, which is often less interesting)


The general class is capitalized, the detail is in an added lowercase letter.


Letters (L):

  • Lu: Uppercase
  • Ll: Lowercase
  • Lt: Titlecase
  • Lm: Modifier
  • Lo: Other

Numbers (N):

  • Nd: Decimal digit
  • Nl: Letter (e.g. roman numerals like Ⅷ)
  • No: Other (funky scripts, but also subscript numbers, bracketed, etc)

Symbols (S):

  • Sm: Math
  • Sc: Currency
  • Sk: Modifier
  • So: Other

Marks (M):

  • Mn: Nonspacing mark
  • Mc: Spacing Combining mark
  • Me: Enclosing mark

Punctuation (P):

  • Pc: Connector
  • Pd: Dash
  • Ps: Open
  • Pe: Close
  • Pi: Initial quote (may behave like Ps or Pe depending on usage)
  • Pf: Final quote (may behave like Ps or Pe depending on usage)
  • Po: Other

Separators (Z):

  • Zs: Space separator,
  • Zl: Line separator,
  • Zp: Paragraph separator,

Other (C):

  • Cc: Control
  • Cf: Format
  • Cs: Surrogate
  • Co: Private Use
  • Cn: Not Assigned (no characters in the unicode.org list explicitly have this category, but implementations are expected to return it for every codepoint not in that list)


Pay attention to unicode version, since there are continuous additions and corrections. For example, older versions may tell you all non-BMP characters are Cn. I believe Pi and Pf were added later, so beware of testing for too few lowercase variations.


On modifiers and marks: TODO

Normal forms

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Or, what's with this equivalence and decomposition stuff?


Unicode's normal forms are mainly meant to compare whether codepoints (and strings containing them) mean the same thing semantically, and/or seem the same thing visually, even when they are not the same codepoints.

To that end, it

  • defines two types of equivalence: canonical equivalence, and compatibility equivalence.
  • lets you decompose characters/strings to what it considers its component parts

As almost a side effect of the last, this is also useful to normalize strings in some specific ways, sometimes to remove diacritics, and other things -- but because this is not its actual aim, you should assume Unicode does neither of those things well.


Equivalence

Canonical equivalence is mainly used to test things that test things that may show up in more than one way and are interchangeable.

Interchangeable in the sense of semantic equivalence: does it mean the same thing.

This usually also implies both will be rendered identically.


Consider é. It can show up in both of the following ways:

OR

  • é which looks the same but is actually two codepoints:

As far as canonical equivalence is concerned, these two are interchangeable.


Compatibility equivalence are for things you might want to reduce to the same thing, but are not interchangeable.

Consider the roman numeral Ⅳ as a single character, U+2163 (Ⅳ).

In some contexts it will be useful to be able to consider it the same as the basic-two-characters string "IV" - yet you would not want to generally consider the letters IV to be the roman numeral. (e.g. SHIV is an all-capsed word, that does not mean SH5)



Note that since

canonically equivalent is "basically the same thing",
compatibility equivalence is more of a fuzzy 'does this look roughly the same',

...anything that is canonically equivalent is implicitly also compatibility equivalent.

And not the other way around, so canonical equivalence is a symmetric relation, and compatibility is not.


Decomposition and composition (and tests using them)

Canonical decomposition and canonical composition will only separate and (re)combine things that are semantically equivalent.

For example, in python:

# canonical decomposition
unicodedata.normalize("NFD", "\u00e9")  == 'e\u0301'

# canonical composition 
unicodedata.normalize("NFC", 'e\u0301') == '\xe9'      


What they actually do is

  • NFD, a.k.a. Normalization Form D (and Normalization Form Canonical Decomposition)
will decompose only to canonically equivalent forms. Examples:
Roman character Ⅳ stays Ⅳ
e-egu becomes separate e and egu
  • NFC, a.k.a. Normalization Form C (and Normalization Form Canonical Composition)
will decomposes canonically, then composes canonically
Roman numeral character Ⅳ stays Ⅳ stays Ⅳ
separate e and egu becomes e-egu


When you want to

  • sort more sensibly
  • index text more sensibly
  • comparison to be semantic and/or fuzzy without doing this transform each time

...then you choose to e.g. run either composition or decomposition over all your text at input time.


There are further operations where normal forms are involved, like

  • removing certain accents (not really a designed feature, more of a 'you can often get decently far')
  • halfwidth and fullwidtth forms include a copy of the basic latin alphabet, mostly for lossless conversions between this and older encodings containing both halfwidth and fullwidth characters (verify)


The conversions aren't quite like the canonical variants. Consider

  • NFKD (Normalization Form Compatibility Composition) decomposes to compatibility-equivalent forms. Examples:
Roman numeral character Ⅳ becomes two-letter "IV" (unicodedata.normalize("NFKD", "\u2163") == 'IV')
e-egu becomes separate e and egu
  • NFKC (Normalization Form Compatibility Composition) decomposes to compatibility-equivalent forms, then recomposes canonically. Examples:
Example: Roman numeral characterⅣ becomes "IV", then stays "IV".
separate e and egu becomes e-egu



Note that there is no compatibility composition - it would make way too many semantically nonsensensical combinations. For example, it doesn't make sense for IV to ever ever become ℃ again.


Planes

Currently, the range open for plane and characer definitions is capped at U+10FFFF (20 bits), but the definition of UCS defines sixteen planes of 65536 (0xFFFF) characters each (using only about ten percent of the 20-bit range), the first fifteen of which are consecutive.

  • BMP (Plane 0, the Basic Multilingual Plane) stores scripts for most any live language. (most of U+0000 to U+FFFF). UCS2 was created to code just this plane.
  • SMP (Plane 1, the Supplementary Multilingual Plane) mostly used for historic scripts (ranges inside U+10000 to U+1DFFF)
  • SIP (Plane 2, the Supplementary Ideographic Plane) stores mHan-Unified characters (ranges inside U+20000 to U+2FFFF)
  • SSP (Plane 14, the Supplementary Special-purpose plane), contants a few nongraphical things, like language markers (a range in U+E0000 to U+EFFFF)
  • Planes 3 to 13 are currently undefined.
Plane 3 has tentatively been called the Tertiary Ideographic Plane, but (as of this writing) has only tentatively allocated characters
  • Plane 15 (U+F0000–U+FFFFF) and plane 16 (U+100000–U+10FFFF) are Private Use Areas A and B, which you can use for your own font design, non-transportable at least in the semantical sense.
(There is also a private use range in BMP, U+E000 to U+F8FF)

Byte Order Marker (BOM)

In storage you often deal with bytes.


UTF16 is 16-bit ints stored into bytes, so endianness becomes a thing.

It's useful to standardized any file serialization to be a specific endianness, or store its overall endianness, but you don't have to: you can optionally start each UTF-16 string with Byte Order Marker (BOM).

The BOM is character U+FEFF, which once it's bytes works out as either:

  • FE FF: means big-endian UTF-16 (UTF-16BE)
  • FF FE: means little-endian UTF-16 (UTF-16LE)


Notes:

  • U+FEFF was previously also used as a zero-width non-breaking space, so could also appear in the middle of a string.

To separate these function, BOM in that use is deprecated, and zero-width non-breaking space is now U+2060.

  • U+FFFE is defined never to be a character, so that the above test always makes sense
  • Because BOMs are not required to be handled by all Unicode parsing, it is conventional that:
    • The BOM character is typically removed by UTF-16 readers
    • If a file/protocol is defined to be a specific endianness, it will generally not contain the BOM (and may be is defined to never use it)



UTF-32 also knows BOMs: (though they are rarely used)(verify)

  • 00 00 FE FF: big-endian UTF-32
  • FF FE 00 00: little-endian UTF-32

(Note that if you don't know whether it's UTF-32 or UTF-16 (which is bad practice, really), the little-endian UTF32 BOM is indistinguishable(verify) from little-endian UTF-16 BOM and a string starting with a NUL (U+0000 (&#x0000;)) (unlikely, but valid)



UTF-8 is a byte coding, so there is only one order it can have in byte storage.

You do occasionally find the BOM in UTF-8 (in encoded form, as EF BB BF). This has no bearing on byte order since UTF-8 is a byte-level coding already. Its use in UTF-8 is mostly as a signature, also to make a clearer case for BOM detectors, though it is usually pointless, and accordingly rare in practice.


Unicode Text Segmentation

Grapheme clusters

Other Unicode encodings

Unicode was its own thing when introduced, and the web now largely uses UTF-8.


There are other encodings, mostly created after Unicode was introduced, that can also encode Unicode characters, not least of which is GB 18030, a Chinese standard which can encode all Unicode planes (verify).


If you consider a fuller list (and excluding UTF-8, UTF-16, arguably UTF-32, and GB18030), then most seem to serve a specific purpose, including:

Compressed formats:

  • SCSU [7] (Standard Compression Scheme for Unicode) is a coding that compresses Unicode, particularly if it only uses one or a few blocks.
  • BOCU-1 [8] (Binary Ordered Compression for Unicode) is derived from SCSU

These have limited applicability. They are not too useful for many web uses, since for content larger than a few kilobytes, generic compression algorithms compress better.

SCSU cannot be used in email messages because it allows characters that may not appear inside MIME. BOCU-1 could, but won't until clients widely decide to support it and they have little reason to these days.


Codings for legacy compatibility:

  • UTF-7 [9] does not use the highest bit and is safe for (ancient) 7-bit communication lines
  • UTF-EBCDIC [10]


(TODO: look at:)

More practical notes

Note: Some of this is very much my own summarization, and may not apply to the contexts, systems, pages, countries and such most relevant to you.


On private use

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


As far as unicode is concerned, these codepoints exist, but have have no meaning.


In terms of semantics and accessibility, it's bad.

It may look like, say, Klingon to you, but not to a computer, unless both sides explicitly agree exactly which non-standard thing they are pasting on top.

If you don't pair it with a font, it won't even look like klingon to you.

So blind people will not see this, and text-to-speech cannot pick it up.


Private use mostly acknowledges that people will want to do this at all - without providing for it, people could only their own glyphs without having to replace some existing characters, and showing something that isn't the character it's defined to be is probably even more confusing (remember the mess that codepages and printer fonts made, and people being confused about Wingdings (it replaces standard ASCII values with their own thing)) than using codepoints that have no meaning.


And without saying what you're adding (and there is no standard) you won't even know.



You should probably avoid it in general unless you have specific reason to.

You should probably avoid it unless you will present it without the corresponding font.


Examples of use:

  • There projects like CSUR (ConScript Unicode Registry[13]) is an organized attempt to assign conlangs to private use ranges in a way that won't conflict - also meaning you can often mix them with only minimal font wizardry




On Asian characters

Beyond plain visible characters

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

There are a few dozen codepoints officially labeled <not a character>, which don't code anything. This e.g. includes

  • the last two codepoints in each plane
e.g. U+FFFE and U+FFFF on the BMP, which are used for BOM use
  • U+FDD0..U+FDEF
apparently "to make additional codes available to programmers to use for internal processing purposes" [14]. You can think of this as a different type of private use
unrelated to the Arabic Presentation Forms-A range it is in, it just had space unlikely to be used


There are a lot of other things that are allocated codepoints, but are not / do not contribute to visible glyphs.

This is not unicode definition, just for my own overview.

  • Control characters (definitions from before Unicode)
    • 'C0 range', referring to the ASCII control characters U+00 to U+1F
    • 'C1 range', U+80-U+9F.
    • U+7F (DEL) is often also usually considered a control character.
  • Surrogates (U+D800 to U+DFFF in two blocks)
in that seen in isolation, they code no character (and make the string ill-formed)
Their real purpose is to be able to code higher-than-BMP codepoints in UTF-16 (including various UTF16-based Unicode implementations)


  • Variations
    • Variation Selectors FE00 to FE0F - used to select standardized variants of some mathematical symbols, emoji, CJK ideographs, and 'Phags-pa [15]
    • Variation Selectors Supplement U+E0100 to U+E01EF [16]
See the Ideographic Variation Database[17][18]
short summary: Emoji uses VS16 (U+FE0F) (and sometimes VS15 U+FE0E to force non-emoji representation); non-CJK sometimes uses VS1 U+FE00; CJK uses VS1, VS2, VS3; VS4 through VS14 are not in use (as of Unicode 11)
Originally intended to tag languages in text (See also RFC 2482), this was deprecated from Unicode 5 on
Unicode 8 and 9 cleared the deprecated status (for all but U+E0001) to allow some other use in the future
https://en.wikipedia.org/wiki/Tags_(Unicode_block)


Combining characters Combining characters - visible, but need another character to go on


See also:


Mixing and messing encodings - mojibake

If you take something encoded into bytes (UTF-8, codepage, other) and assume/guess wrong about what they are or what they should be decoded as (or maybe encoding it twice), then you get a garbeled mess.


One of the wider names for such mistakes is mojibake.


And there's a bunch of possible combinations, so it varies whether you can detect the specific problem, and whether you can then fix it after the fact.


There are some more specific forms that are easier to recognize.

Bad encoding-decoding combinations: Â, Ã, etc.

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

When you take a text coding that was already encoded into bytes (charsets or UTF-8 or other), and encode that using UTF-8 (or something else), you get mangled data.


There's many variations on that theme, but the most common are probably

UTF-8 encoding Latin1/CP-1252 byte data, and
UTF-8 encoding UTF-8 byte data.


Nineties to noughties, the internet started storing and sending UTF-8, but web browsers would often still think Latin1/CP-1252 was a thing we wanted, and webservers might even tell us so explicitly (usually as a default/fallback, when you can't know).

This largely went away because browsers started defaulting to UTF-8 instead and, to a lesser degree, people started being better about specifying the encoding in use. Or the standard said so - e.g. HTML5 made a point of being UTF-8 by default.



Reasons, and recognizing these

  • bytes-as-codepoints will always sit within U+00 and U+FF
UTF8 bytes will be within U+80 and U+FF
while they can be anything within it, in western practice the first byte may stick to fewer values than that - consider...


  • U+0080 through U+07FF covers much of the additional characters seen in Europe and other Latin-related alphabets
those codepoints all become two-byte UTF8, and the first byte of which sits within 0xC2 through 0xDF.
If you misinterpret those as codepoints as-is, well, U+C2 through U+DF is Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
more specifically
the bulk of western Europe's accented letters sits within U+A0..U+FF (in Latin Supplement), which in terms of UTF8 first bytes is 0xC2 and 0xC3 (if you misinterpret as codepoints: Â Ã)
some less usual (and unusual) characters sit in Latin Extended-A and Latin Extended-B, U+100 through u+24F, UTF-8 first bytes 0xc4 through 0xc9; if you misinterpret: Ä Å Æ Ç È É
combining diacritical marks's first UTF8 bytes are 0xCC, 0xCD (Ì Í)
Things like Greek, Cyrillic, Hebrew, Arabic 0xCD through, 0xDD (Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý


  • there are some specific outliers
e.g. word processors are known to replace `apostrophes characters around words' with ‗ (U+2017, LEFTSINGLE QUOTATION MARK) and ‘ (U+2018, RIGHT SINGLE QUOTATION MARK),
which as UTF8 bytes is e2 80 98 and e2 80 99
if these bytes are shown via CP-1252 (as they used to be), you will see ’ (for RIGHT SINGLE QUOTATION MARK)
if these bytes are shown as codepoints, (U+E2 U+80 U+99), this is less visible, because while U+E2 is â, the last two are control codes, and won't have a printed character.


In a good number of cases you can un-mangle this if you manage to determine what happened. In some cases you can have a very good guess (e.g. seeing U+E2 U+80 U+99 a bunch of times in otherwise basic western text), other otherwise automatically estimate that from the data you give it.


http://www.i18nqa.com/debug/utf8-debug.html

Emoji

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Types of emoticons / emoji (wider view than unicode)

Short history:

  • Emoticons originally mostly referred to combinations of ASCII, mostly simple smilies like :)


  • ...plus other characters we already had (unicode or not), most of them local ones, e.g.
Japanese kaomoji which started as ASCII like (*_*) (also seen in the west) but now have variants like (´。• ω •。`)
Korean style (like ㅇㅅㅇ),
down to patterns just seen in one place more than another, like ó_ò in Brazil, ಠ_ಠ initially on forums, etc.


  • some sites, mostly message boards, would replace specific text (like :)) with images.
Often mostly the simpler ASCII smilies, but some more extensive
  • The same boards are also known for adding further shortcodes, specific strings like :bacon: as unique test markers to always replace with images
more flexible, sometimes board-specific - and for that same reason is less widespread


  • Later, a decent number of well used emoji were added to Unicode (most after 2010)
which would render as-is assuming you have a font, however...
various sites/devices do image replacement for these Unicode, to get color and their specific style.
to the point that this is part of their text editors


Wider support of emoji in Unicode emoji made things more interesting (for a few different reasons).

Image replacement of Unicode was initially more often HTML-level replacement of a character with an img tag. These days the image replacement is often more integrated, where an app and/or phone input fields and/or browser(verify) shows the replacement as you type.

The same browsers/apps will also often provide decent emoji keyboards/palletes to make it much easier to enter them in the first place.

This kind of support is fairly recent development (see e.g. http://caniemoji.com/), and getting pretty decent.

You can consider this more native support of emoji -- except also somewhat misleading. What gets sent is not the image you see but the unicode underlying it. And because there's now at least a dozen sets of images, it may look different in another app.

More on image replacement

App can choose to replace creative-use-of-character emoji, of shortcodes, and/or unicode - with unicode and/or images.

We've somewhat shifted away from shortcodes, and from using custom image sets as a replacement (both work poorly between systems).


We've shifted to to unicode emoji, probably in part because input palettes being largely unicode made those more reachable.


Apps may still use their own private set of image replacement, but this is mostly brand-specific aesthetics, of the same shared underlying characters.


Note that

  • the amount of characters any of these replace varies.
  • apps may implicitly inherit this set from the platform they are on
particularly phone apps may get this
  • apps may explicitly adopt one of them
  • and yes, there have been some cases where the image aesthetics actually changed meaning, which led to miscommunication between people on different platforms
https://grouplens.org/blog/investigating-the-potential-for-miscommunication-using-emoji/


Emoji sets include:

Google [19] [20]
Microsoft [21]
Apple [22]
Samsung [23]
HTC [24]
LG [25]
WhatsApp [26]
Twitter [27], also used by e.g. Discord
Snapchat [28]
Facebook [29]
Mozilla [30]
EmojiOne [31] - licensable set used e.g. by Discourse, Slack
GitHub [32]
GMail - had a smaller set (used in subject lines, often square-looking)


Emoji according to unicode

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

While a bunch of emoticon-like characters have been around for very long. If you include varied symbols like ☹ ☺ ☻ since Unicode 1.

Only later (Unicode 6 (~2010) and later) did we start (bulk) adding emoji in more dedicated ranges. By amount, these are the bulk of what we now consider emoji.

Data that standardized Emoji came later, with Emoji 1 (~2015) which started its own versioning (1, 2, 3, 4, 5, 11, 12, 13 - the jump comes from a choice to synchronize Emoji version with Unicode version).


Note also that Emoji makes a distinction between emoji representation and non-emoji representation. Any one character may have either or both, and more interestingly, some characters got an emoji representations to some characters well after their first introduction (e.g. ❤ got ❤️).


Emoji 1 was largely a roundup of all the emoji-like codepoints from the first seven(verify) versions of Unicode,

approx 1200 of them (counting flags)


Emoji 2 (came soon after 1) introduced sequences, and defined specific uses of (already-existing) skin tone, plus some other modifiers.

Versions since then have mostly just expanded on characters, sequences, and modifiers.

Emoji 3

Emoji 4

Emoji 5

Emoji 11

Emoji 12 and Emoji 12.1

Emoji 13 and Emoji 13.1

Emoji 14




Notes:

  • Note that brand new emoji often are not immediately known in browsers/devices, and font support generally lags much more.
  • If you're going to parse unicode's emoji data, you'll probably want to read [33].




See also:



Flags

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Flags do not have singular codepoints - this would be slightly awkward to keep up to date due to changing geopolitics.


When you restrict yourself to countries, things stay relatively simple. Sort of.

Unicode 6 added an range of letters used only as regional indicators (U+1F1E6 through U+1F1FF, within Enclosed Alphanumeric Supplement).

It comes down to "use these in pairs to refer to ISO 3166-1 alpha-2 code"[34].

(spoilers: except exceptions)


So, for example, 🇧🇪 is actually the sequence of

U+1F1E7 REGIONAL INDICATOR SYMBOL LETTER B
U+1F1EA REGIONAL INDICATOR SYMBOL LETTER E

Doing it this way means

  • there's just one authoritative place to keep up to date.
  • changes in countries is up to implementations following changes to ISO 3166, not definition within Unicode.
  • ...flags can actually show up differently in a lot of places.



One practical question is "where do you stop with regions?"

The above is an answer, but not a fully satisfying one.

Consider the UK.


This also runs right into politics.

A Microsoft or an Apple can, say, decide to include the Israelian flag and Palestinian flag. Might be a bit contentious depending on where you are at the moment, but you can defer to some type of neutrality this way.

But what to think about Apple removing the ability to render Taiwanese flag (TW, 🇹🇼) only if the locale is mainland China? The reason clearly seems that the flag can symbolize Taiwanese independence, which the Chinese government is, ahem, not a fan of - but Apple still wants to sell iPhones to a billion chinese people, soooo.... You now probably have opinions on the ethical stance that Apple is not taking there.

Windows seems to have specifically dropped flag support (browsers have their own implementation because windows does not). Which might well be related to such politics.


Unicode just side-stepped such issues issue in the first place by basically saying "we give you a system, you decide how to use it"


Unsorted

Right-to-left

https://www.w3.org/International/articles/inline-bidi-markup/uba-basics

http://unicode.org/reports/tr9/