Phonetic scripts

From Helpful
Jump to navigation Jump to search

Note that

  • outside of linguistics, 'phonetic alphabet' often refers to a radiotelephonic spelling alphabet
  • ...but in lingiustics it means "phonetic scripts" means "way to write distinct phonemes"


IPA

The International Phonetic Alphabet (IPA) allows standardized, detailed and international phonetic and phonemic notation, and is also rather useful in specifying alternative pronunciations.

For example /bɪˈkɔz/, /bɪˈkɒz/, and /bɪˈkʌz/ are alternative pronunciations of 'because.'


IPA 1990, IPA93

Text-coding-wise, SIL defined 'IPA93' (pre-Unicode) but this has been largely abandoned in favour of Unicode.

There was also a SIL IPA 1990

See e.g.:

https://scripts.sil.org/cms/scripts/page.php?id=encore-ipa&site_id=nrsi
https://scripts.sil.org/cms/scripts/page.php?id=ipahome&site_id=nrsi#836e214f
https://scripts.sil.org/cms/scripts/page.php?id=mappingfiles&site_id=nrsi#SILIPA1990


They are effectively byte codings that were different from ascii and most codepages. And there was a font to go along.

You can find mappings from these to unicode IPA, probably under:

https://github.com/silnrsi/wsresources/tree/master/scripts/Latn/legacy/

ASCII IPA

An unofficial encoding created to be able to communicate IPA characters over plain text, e.g. email.


Mapping IPA to ASCII characters necessarily makes it a multiple-character-length encoding, and it happens to be a variable-length one. It is more verbose, and/but also easier to parse (human and machine) than other similar attempts.

It allows some alternative ways of writing the same thing, so is a little harder, and potentially more flexible. Because of the overlap in goals, the most succinct form of ASCII IPA looks like X-SAMPA.

See also:

Notes on Unicode-based IPA

Fonts

When writing phonetic data, it is very useful to use a font that actually contains (and so draws) all phonetic Unicode characters you refrence.

  • Fonts with 'Unicode' in their name, such as 'Arial Unicode MS' and 'Lucida Sans Unicode,' have wide characters coverage and may also serve you well, though are not guaranteed to.

For installation instructions for Windows and Linux, see a page like http://helpful.knobs-dials.com/index.php/Font_installing_notes


This doesn't cover everything, and doesn't always agree with the next section (I'm not in the field anymore, so if some linguist wants to absorb all this detail into their own better set of notes, go right ahead.)

Character choice

There are some Unicode blocks specifically for phonetic characters, inclusing

covers most of regularly used IPA
not used by IPA
use by e.g. the Uralic Phonetic Alphabet, Old Irish phonetic notation, some dictionaries, and Americanist and Russianist phonetic notations[1].
a few specialized and some deprecated IPA characters(verify) [2]


There are also a handful that are not in there, that Unicode itself notes should be taken from existing characters.

This non-IPA-used-as-IPA set seems to include U+00E6 (æ), U+00E7 (ç), U+00F0 (ð), U+00F8 (ø), U+0127 (ħ), U+014B (ŋ), U+0153 (œ), U+03B2 (β), U+03B8 (θ), U+03BB (λ), U+03C7 (χ)

...so mostly from the Latin-1 Supplement, Latin Extended-A, and Greek and Coptic blocks.


...and arguably

  • 'global fall' and 'global rise' could be imitated using Arrows (U+2190 to U+21FF).


Notes:

  • (Because of c-with-cedilla (and apparently mostly/only it?), you should probably not decompose phonetic data.)
  • There are a few that look very similar, but are distinct.
Say, the voiced velar stop U+0261 (ɡ) is distinct from the letter g, U+67 (g).
  • Note: U+25CC (dotted circle) is regularly used to have something to anchor a combining diacritic on for display, and to suggest the relative position it will sit at. (The unicode.com reference PDFs do this, for example)


For some references, see:

There are various narrower definitions (say, the Uralic Phonetic Alphabet, which has been defined decently in Unicode) and many more local conventions (for example, various tone-marking conventions are mentioned in Priest's "Marking Tone").



You may care about this tool for translating from IPA and X-SAMPA, to IPA, X-SAMPA, and the LaTeX tipa package

Input methods

If you're transcribig a lot of sounds, you may care to find a method that lets you type what you hear, rather than search and click a lot.


Online IPA character pickers

Javascript-based click-and-point sheets that build a string look like the least bother for small jobs. Try:

...among others. Some of them even have sound.

Character picker apps and plugins

IPA-specific pickers:

General Unicode character pickers:

Copy-pasting tends to be tedious on the long term, though you can relieve that a little by keeping common ones in a document.

Apps

  • yudit (linux) is an X-based program which does text-rewriting -- in just its GUI meaning you can copy-paste, and don't have to understand input method trickery to use it.
Note it has an ASCII-IPA and a SAMPA method, but both seem incomplete - so unless you're already used to them, you probably shouldn't start using them, and should write a map that either rewrites your local transcription system to unicode, or X-SAMPA to it.

I have data to do the latter, mail me if interested.

Keyboard tricks

If you want to transcribe using just the keyboard, and learn to eventually do so quite fast, you can redefine keyboard behaviour, often allowing output of additional characters under specific key combinations and/or key sequences.

Note that when it comes to keyboard remaps, there are many non-general maps available. You can make your own; there are free keymap editors for each platform.


In windows:

  • you can use windows' Alt-number input method, but this accepts decimal numbers while Unicode writes codepoints in hexadecimal, so you'll you'll want a side-by-side reference, like the one on this UCL page.
  • The 'Quick Unicode Input tool' is a slight improvement on this

Keyboard maps: (limited to certain windows versions)

  • keyboard layouts that re-map the usual keys to phonetic characters (like this)
  • Keyman-style rewriters [3].


Linux likes input methods. Many implementations are specific to asian language input and are built on things tied somewhat to X or window managers:

  • UIM (Universal Input Method), which has a supporting toolbar/systray app, and includes an IPA definition
  • SCIM (Smart Common Input Method), which also has a systray app
  • the older, XIM (X Input Method), which is often used as a (multi-)key rewriter (also used by Qt, KDE)
  • There is also a GTK+ IM, which is nice but cannot be used by non-GTK apps, and therefore impractical as a suggestion.
  • IIIMF, defined more generally (not tied to X, a window manager, language, or operating system) so can be integrated fairly widely


You can also use the X KeyBoard Extension (XKB), for example

change the mappings of exististing keys (e.g. media keyboard buttons), or
define new make key combinations (e.g. shift+ctrl+specific letter keys)

and map those to phonetic characters instead


Mac OS X:


TODO: experiment, and write up how to actually work each of these.

SAMPA and X-SAMPA

SAMPA

The Speech Assessment Methods Phonetic Alphabet, is an all-ASCII phonetic script based on the IPA, that was a convenient way to communicate between varying systems and media.


Many distinct SAMPA charts exist, each specific to a language, and there are some redundancies and ambiguities between them (same character used for different sounds in different charts, and such).

See also


X-SAMPA

The Extended SAMPA variant unifies the separate SAMPA charts, resolving ambiguities. It is generally preferable over SAMPA.

It deals with almost all of IPA and has an almost 1:1 mapping between IPA and it.


'Almost' because:

  • there are some X-SAMPA concepts for which there are no direct translation, often because it it not official (e.g. segmental notation)
  • there are some implicit choices/preferences:
    • sometimes there are two X-SAMPA codes for something, usually one coming from an old convention (e.g. the labiodental approximant)
    • in IPA there are two ways of showing level/contour, by diacritic and by character. Since these are equivalent, X-SAMPA does not code them separately
  • there are some details to do with composition:
    • Some X-SAMPA concepts represent multiple from IPA. For example, IPA uses different rhoticity symbols for vowels and consonants, X-SAMPA does not.
    • Some character-diacritic combinations can be composed into a single character / decomposed into two. (It may even be that this step changes the combining character involved(verify))

These are relatively few and uncommon cases. They do make automatic and lossless translation a little harder, though.


See also:

Other (often specific-purpose)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Extended IPA (ExtIPA)

Meant to widen IPA's transcription abilities to include speech disorder effects, and includes secondary pronunciation details, prosodic details and such.

See Wikipedia: Extended IPA

CXS, Conlang X-SAMPA

A slightly modified (and incompatible!) form of X-SAMPA, the Conlang mailing list's variation on X-SAMPA consists of a handful of changes and additions to X-SAMPA by members of the conlang language construction mailing list.

It adds two unofficial vowels, encodes some diacritics differently, as well as encoding four existing vowels differently, but reuses characters. As such, it is ambiguous/incompatible with standard X-SAMPA. (I'm not sure how serious to take this variation because of that detail).


Praat

Praat has its own trigraph approach to typing in phonemes (apprently stored as such, and shown via Unicode).


https://www.fon.hum.uva.nl/praat/manual/Phonetic_symbols.html



CMU

A simple, english-only coding used for the CMU pronouncing dictionary, which is a freely available word-pronunciation mapping.

SAMPROSA

The SAM PROSodic Alphabet is made for prosodic transcription

See e.g. [4]


ARPAbet

ARPABET was early (1970s) ASCII-only coding apparently for simple speech synthesis. It is not based on IPA and not as fine-grained as it.


TIMIT

TIMIT was commisioned by DARPA for similar purposes, in the eighties.

There is also a TIMIT speech corpus originating in the early nineties. See e.g. [5]

WSJ

Apparently used in DARPA's Wall Street Journal corpus(verify).


SWB

Apparently used in the ISCI Switchboard Transcription System corpus(verify).


AHD

Probably referring to American Heritage Dictionary.(verify)


VoQS

Voice Quality Symbols


canIPA

http://venus.unive.it/canipa/


FLOSS

Fonemic Latin-One Spelling System (FLOSS), by Alan Beale, derived from MCM

http://www.wyrdplay.org/AlanBeale/FLOSS-ref.html

FLEWSY

Fonemic Latin-1 English Writing SYstem (FEWLSY), by Alan Beale, derived from MCM

See also FEWL.

MCM

Mixed Case Minglish (MCM), by Alan Beale

http://www.wyrdplay.org/AlanBeale/MCM-ref.html

In LaTeX

You probably want to use the tipa package. See:


Most useful for tipa character reference are:

See also



To look at: