Phonetic scripts
Note that
- in lingiustics, "phonetic scripts means "way to write distinct phonemes"
- outside of linguistics, 'phonetic alphabet' often refers to a radiotelephonic spelling alphabet
IPA
The International Phonetic Alphabet (IPA) allows standardized, detailed and international phonetic and phonemic notation, and is also rather useful in specifying alternative pronunciations.
For example /bɪˈkɔz/, /bɪˈkɒz/, and /bɪˈkʌz/ are alternative pronunciations of 'because.'
IPA93
Text-coding-wise, SIL defined 'IPA93' (pre-Unicode) but this has been largely abandoned in favour of Unicode.
ASCII IPA
An unofficial encoding created to be able to communicate IPA characters over plain text, e.g. email.
Mapping IPA to ASCII characters necessarily makes it a multiple-character-length encoding, and it happens to be a variable-length one.
It is more verbose, and/but also easier to parse (human and machine) than other similar attempts.
It allows some alternative ways of writing the same thing, so is a little harder, and potentially more flexible. Because of the overlap in goals, the most succinct form of ASCII IPA looks like X-SAMPA.
See also:
- http://www.kirshenbaum.net/IPA/ascii-ipa.pdf (basic definition/reference)
- http://www.kirshenbaum.net/IPA/
- http://www.kirshenbaum.net/IPA/faq.html
- http://www.kirshenbaum.net/IPA/english.html
Notes on Unicode-based IPA
Fonts
When writing phonetic data, it is very useful to use a font that actually contains (and so draws) all phonetic Unicode characters you refrence.
- See also this list.
- Fonts with 'Unicode' in their name, such as 'Arial Unicode MS' and 'Lucida Sans Unicode,' have wide characters coverage and may also serve you well, though are not guaranteed to.
For installation instructions for Windows and Linux, see a page like http://helpful.knobs-dials.com/index.php/Font_installing_notes
This doesn't cover everything, and doesn't always agree with the next section
(I'm not in the field anymore, so if some linguist wants to absorb all this detail into their own better set of notes, go right ahead.)
Character choice
There are some Unicode blocks specifically for phonetic characters, inclusing
- IPA extensions (U+0250 to U+02AF)
- covers most of regularly used IPA
- Phonetic Extensions (U+1D00 to U+1D7F)
- not used by IPA
- use by e.g. the Uralic Phonetic Alphabet, Old Irish phonetic notation, some dictionaries, and Americanist and Russianist phonetic notations[1].
- Phonetic Extensions Supplement (U+1D80 to U+1DBF)
There are also a handful that are not in there, that Unicode itself notes should be taken from existing characters.
This non-IPA-used-as-IPA set seems to include
U+00E6 (æ), U+00E7 (ç), U+00F0 (ð), U+00F8 (ø),
U+0127 (ħ), U+014B (ŋ), U+0153 (œ), U+03B2 (β), U+03B8 (θ), U+03BB (λ), U+03C7 (χ)
...so mostly from the Latin-1 Supplement, Latin Extended-A, and Greek and Coptic blocks.
...and arguably
- Latin Extended-B (U+0180 to U+024F), specifically the four click characters, U+01C0 (ǀ), U+01C1 (ǁ), U+01C2 (ǂ), U+01C3 (ǃ).
- Combining Diacritical Marks (U+0300 to U+036F) for some diacritics
- Spacing Modifier Letters (U+02B0 to U+02FF) for some more diacritics
- Superscripts and Subscripts (2070–209F) to imitate nasal release (◌ⁿ) using U+207f (ⁿ)
- Modifier Tone Letters (U+A700 to U+A71F) for tone bars
- 'global fall' and 'global rise' could be imitated using Arrows (U+2190 to U+21FF).
Notes:
- (Because of c-with-cedilla (and apparently mostly/only it?), you should probably not decompose phonetic data.)
- There are a few that look very similar, but are distinct.
- Say, the voiced velar stop U+0261 (ɡ) is distinct from the letter g, U+67 (g).
- Note: U+25CC (dotted circle) is regularly used to have something to anchor a combining diacritic on for display, and to suggest the relative position it will sit at. (The unicode.com reference PDFs do this, for example)
For some references, see:
- http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm (also mentions useful fonts)
- http://en.wikipedia.org/wiki/Unicode_Phonetic_Symbols
There are various narrower definitions (say, the Uralic Phonetic Alphabet, which has been defined decently in Unicode) and many more local conventions (for example, various tone-marking conventions are mentioned in Priest's "Marking Tone").
You may care about this tool for translating from IPA and X-SAMPA, to IPA, X-SAMPA, and the LaTeX tipa package
Input methods
If you're transcribig a lot of sounds, you may care to find a method that lets you type what you hear, rather than search and click a lot.
Online IPA character pickers
Javascript-based click-and-point sheets that build a string look like the least bother for small jobs. Try:
- http://www.linguiste.org/phonetics/ipa/chart/keyboard/
- http://linguistlist.org/unicode/ipa.html
- http://people.w3.org/rishida/scripts/pickers/ipa/ (from W3, one of many pickers)
...among others. Some of them even have sound.
Character picker apps and plugins
IPA-specific pickers:
- IPA Palette for Mac OS X
- IPA Charmap for Windows
General Unicode character pickers:
- Windows' charmap
- BabelMap (windows)
- KDE's kcharselect, GNOME's gcharmap, and others
- Office:
- 'Insert Character'
- Uniqoder (Word 97, 2000)
- Office 2000 visual keyboard
- Office 2003 character toolbar
Copy-pasting tends to be tedious on the long term, though you can relieve that a little by keeping common ones in a document.
Apps
- yudit (linux) is an X-based program which does text-rewriting -- in just its GUI meaning you can copy-paste, and don't have to understand input method trickery to use it.
- Note it has an ASCII-IPA and a SAMPA method, but both seem incomplete - so unless you're already used to them, you probably shouldn't start using them, and should write a map that either rewrites your local transcription system to unicode, or X-SAMPA to it.
I have data to do the latter, mail me if interested.
Keyboard tricks
If you want to transcribe using just the keyboard, and learn to eventually do so quite fast, you can redefine keyboard behaviour, often allowing output of additional characters under specific key combinations and/or key sequences.
Note that when it comes to keyboard remaps, there are many non-general maps available. You can make your own; there are free keymap editors for each platform.
In windows:
- you can use windows' Alt-number input method, but this accepts decimal numbers while Unicode writes codepoints in hexadecimal, so you'll you'll want a side-by-side reference, like the one on this UCL page.
- The 'Quick Unicode Input tool' is a slight improvement on this
Keyboard maps: (limited to certain windows versions)
- keyboard layouts that re-map the usual keys to phonetic characters (like this)
- Keyman-style rewriters [3].
Linux likes input methods. Many implementations are specific to asian language input and are built on things tied somewhat to X or window managers:
- UIM (Universal Input Method), which has a supporting toolbar/systray app, and includes an IPA definition
- SCIM (Smart Common Input Method), which also has a systray app
- the older, XIM (X Input Method), which is often used as a (multi-)key rewriter (also used by Qt, KDE)
- There is also a GTK+ IM, which is nice but cannot be used by non-GTK apps, and therefore impractical as a suggestion.
- IIIMF, defined more generally (not tied to X, a window manager, language, or operating system) so can be integrated fairly widely
You can also use the X KeyBoard Extension (XKB), for example
- change the mappings of exististing keys (e.g. media keyboard buttons), or
- define new make key combinations (e.g. shift+ctrl+specific letter keys)
and map those to phonetic characters instead
Mac OS X:
TODO: experiment, and write up how to actually work each of these.
SAMPA and X-SAMPA
SAMPA
The Speech Assessment Methods Phonetic Alphabet, is an all-ASCII phonetic script based on the IPA, that was a convenient way to communicate between varying systems and media.
Many distinct SAMPA charts exist, each specific to a language, and there are some redundancies and ambiguities between them (same character used for different sounds in different charts, and such).
See also
X-SAMPA
The Extended SAMPA variant unifies the separate SAMPA charts, resolving ambiguities. It is generally preferable over SAMPA.
It deals with almost all of IPA and has an almost 1:1 mapping between IPA and it.
'Almost' because:
- there are some X-SAMPA concepts for which there are no direct translation, often because it it not official (e.g. segmental notation)
- there are some implicit choices/preferences:
- sometimes there are two X-SAMPA codes for something, usually one coming from an old convention (e.g. the labiodental approximant)
- in IPA there are two ways of showing level/contour, by diacritic and by character. Since these are equivalent, X-SAMPA does not code them separately
- there are some details to do with composition:
- Some X-SAMPA concepts represent multiple from IPA. For example, IPA uses different rhoticity symbols for vowels and consonants, X-SAMPA does not.
- Some character-diacritic combinations can be composed into a single character / decomposed into two. (It may even be that this step changes the combining character involved(verify))
These are relatively few and uncommon cases. They do make automatic and lossless translation a little harder, though.
See also:
Other (often specific-purpose)
Extended IPA (ExtIPA)
Meant to widen IPA's transcription abilities to include speech disorder effects, and includes secondary pronunciation details, prosodic details and such.
CXS, Conlang X-SAMPA
A slightly modified (and incompatible!) form of X-SAMPA, the Conlang mailing list's variation on X-SAMPA consists of a handful of changes and additions to X-SAMPA by members of the conlang language construction mailing list.
It adds two unofficial vowels, encodes some diacritics differently, as well as encoding four existing vowels differently, but reuses characters. As such, it is ambiguous/incompatible with standard X-SAMPA. (I'm not sure how serious to take this variation because of that detail).
Praat
Praat has its own trigraph approach to typing in phonemes (apprently stored as such, and shown via Unicode).
https://www.fon.hum.uva.nl/praat/manual/Phonetic_symbols.html
CMU
A simple, english-only coding used for the CMU pronouncing dictionary, which is a freely available word-pronunciation mapping.
SAMPROSA
The SAM PROSodic Alphabet is made for prosodic transcription
See e.g. [4]
ARPAbet
ARPABET was early (1970s) ASCII-only coding apparently for simple speech synthesis. It is not based on IPA and not as fine-grained as it.
TIMIT
TIMIT was commisioned by DARPA for similar purposes, in the eighties.
There is also a TIMIT speech corpus originating in the early nineties. See e.g. [5]
WSJ
Apparently used in DARPA's Wall Street Journal corpus(verify).
SWB
Apparently used in the ISCI Switchboard Transcription System corpus(verify).
AHD
Probably referring to American Heritage Dictionary.(verify)
VoQS
Voice Quality Symbols
canIPA
FLOSS
Fonemic Latin-One Spelling System (FLOSS), by Alan Beale, derived from MCM
http://www.wyrdplay.org/AlanBeale/FLOSS-ref.html
FLEWSY
Fonemic Latin-1 English Writing SYstem (FEWLSY), by Alan Beale, derived from MCM
See also FEWL.
MCM
Mixed Case Minglish (MCM), by Alan Beale
http://www.wyrdplay.org/AlanBeale/MCM-ref.html
In LaTeX
You probably want to use the tipa package. See:
Most useful for tipa character reference are:
- one or two of the smaller tables from the tipa documentation
- TIPA Cheat sheet (by Sven Grawunder)
- tipachart.pdf
See also
- Wikipedia: X-SAMPA.
- 'Computer-coding the IPA: a proposed extension of SAMPA'
- Images of the IPA chart with X-SAMPA alongside each phone; try a google image search. Note there is also a conlang version, which you probably do not want.
To look at: