Localization, internationalization
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
In the context of computers:
- internationalization(/internationalisation) : refers to the part of software that lets it easily be used in various languages and locales.
- localization(/localisation): Refers to the wish, implementation, and part of the OS environment that defines locale-specific handling of certain details, including formatting of numbers, dates, money amounts
- Globalization: Term used by IBM, referring to both
To save a lot of typing
- Internationalization is also known as i18n
- Localization is also known as L10n
- similarly, accessibility is seen as a11y, canonicalization is seen as c14n
See also:
- http://en.wikipedia.org/wiki/Internationalization_and_localization
- http://en.wikipedia.org/wiki/Locale
Unsorted
Linux locale setting
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
To see the active locale settings, use
locale
This should show something like:
LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
To see the locales that are installed, use:
locale -a
which typicaqlly show anywhere between a few and few dozen locales. I removed most locales from my system, so I only have a few::
C en_GB en_GB.iso88591 en_GB.utf8 en_US en_US.iso88591 en_US.utf8 POSIX
Say I want to use en_US.utf8. (note: you should copy-paste whatever you want verbatim. There are a few differences between OSes, e.g. en_US.utf8 versus en_US.UTF-8)
I would probably edit login settings (the best way to do this varies per system) to include something like:
export LC_ALL="en_US.utf8" export LANG="en_US.utf8"
So what does the above actually do?
Specific programs and functions that look for these variables will do formatting and parsing in a specific, more localized way.
More central
LC_TIME - how to show the names of weekdays and months, and some ordering
- but there are various libraries that will only use some of this (e.g. strftime lets you control order, though do pick up the names)
LC_CTYPE helps define what characters are letters, punctuation; capitals, transliteration, etc.
- in practice: text utilities may parse text slightly differently, e.g. the way they look for word boundaries.
LC_MESSAGES - language to use for messages
LC_NUMERIC - e.g. whether to use . or , as decimal separator resp. thousands separator
- in practice: a lot of things don't use this, so you'll probably see a mix. (verify)
LC_COLLATE is actually more bother than useful
- it's less flexible than it needs to be for many real-world collation rules
- some cases may break things like [A-Z] regexes
And relatively rarely used stuff:
LC_MONETARY - how to format money amounts
- few things use this
LC_PAPER - size in mm
LC_NAME
LC_ADDRESS
LC_TELEPHONE
LC_MEASUREMENT - 1 for metric, 2 for US
LC_IDENTIFICATION
Additionally there are LANG and LC_ALL. Neither of which need to be set, if the above are, because:
- LC_ALL forces the locale for all categories
- and seems meant to more easily force consistent behaviour in scripts
- LANG sets the default locale for all categories - and setting any category overrides this
- so this is a convenience thing, and not directly picked up(verify)
Other notes:
- there is often a POSIX and C locale.
- C is meant to be a simple, non-interpreting locale, useful to force collation to be bytewise, have characters always be bytes (most other locales are UTF8 these days), and force the decimal separator to be .(e.g. useful for some shell arithmetic in scripts)
- POSIX is similar. In some cases it is an alias of C[1], in other cases it is a definition that differs in a few details like, apparently, no explicit no definition for sorting non-ascii bytes but still effectively the same as C (verify)
- C is meant to be a simple, non-interpreting locale, useful to force collation to be bytewise, have characters always be bytes (most other locales are UTF8 these days), and force the decimal separator to be
https://sourceware.org/glibc/wiki/Locales#Locale_File_Format
https://unix.stackexchange.com/questions/87745/what-does-lc-all-c-do