Parallel texts

From Helpful
Revision as of 13:31, 14 July 2015 by Helpful (talk | contribs) (Created page with "{{stub}} Parallel texts (a.k.a. bi-texts) are pieces of text alongside translations of the same text. Often aligned to some degree. Text alignment is in itself a relati...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Parallel texts (a.k.a. bi-texts) are pieces of text alongside translations of the same text.

Often aligned to some degree. Text alignment is in itself a relatively complex problem when you get to sub-sentence level, because of differing structures in different languages.


In computational linguistics

The major thing often done with bitexts is the generation of a word / segment alignment.

After such alignment, certain information can be extracted about equivalences, differences, term translations, etc.


The alignment problem can be seen in various ways, one of which is the generation of a bitext map (seen as injective, 1-to-1), see as having two dimensions of character (or word) positions, and the resulting alignment data can be visualized as as segmented fixed at certain positions where position equivalence is certain. Common such points include starts/ends of chapters, lists, paragraphs and, depending on how literal the translation is, sentences.


Features looked at to generate and verify alignments can include:

  • word length
  • word similarity measures (e.g. LCSR)
  • word POS (often preserved in translations, parituclarly if fairly literal)
  • cognates (when alphabets are similar)
  • ...and various other things.


Unsorted

Various sources of translations are useful as bitexts, including sites that provide movie subtitles, and more ready-made corpora such as EuroParl


Smooth Injective Map Recognizer (SIMR)

Geometric Segment Alignment (GSA)

See also

  • [Copora]