Parallel texts
Parallel texts (a.k.a. bi-texts) are pieces of text alongside translations of the same text.
They are often aligned to some degree - often per paragraph or per sentence, sometimes per-word-as-far-as-that-works. Text alignment is in itself a relatively complex problem when you get to sub-sentence level, because of differing structures in different languages, but that's often one of the things being studied.
In computational linguistics
The major thing often done with bitexts is the generation of a word / segment alignment.
After such alignment, certain information can be extracted about equivalences, differences, term translations, etc.
The alignment problem can be seen in various ways, one of which is the generation of a bitext map (seen as injective, 1-to-1),
see as having two dimensions of character (or word) positions, and the resulting alignment data can be visualized as as segmented fixed at certain positions where position equivalence is certain. Common such points include starts/ends of chapters, lists, paragraphs and, depending on how literal the translation is, sentences.
Features looked at to generate and verify alignments can include:
- word length
- word similarity measures (e.g. LCSR)
- word POS (often preserved in translations, parituclarly if fairly literal)
- cognates (when alphabets are similar)
- ...and various other things.
Unsorted
Various sources of translations are useful as bitexts, including sites that provide movie subtitles, and more ready-made corpora such as EuroParl
Smooth Injective Map Recognizer (SIMR)
Geometric Segment Alignment (GSA)
See also
- [Copora]