Parallel texts

From Helpful
Jump to navigation Jump to search
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Parallel texts (a.k.a. bi-texts) are pieces of text alongside translations of the same text.

They are often aligned to some degree - often per paragraph or per sentence, sometimes per-word-as-far-as-that-works. Text alignment is in itself a relatively complex problem when you get to sub-sentence level, because of differing structures in different languages, but that's often one of the things being studied.


In computational linguistics

The major thing often done with bitexts is the generation of a word / segment alignment.

After such alignment, certain information can be extracted about equivalences, differences, term translations, etc.


The alignment problem can be seen in various ways, one of which is the generation of a bitext map (seen as injective, 1-to-1), see as having two dimensions of character (or word) positions, and the resulting alignment data can be visualized as as segmented fixed at certain positions where position equivalence is certain. Common such points include starts/ends of chapters, lists, paragraphs and, depending on how literal the translation is, sentences.


Features looked at to generate and verify alignments can include:

  • word length
  • word similarity measures (e.g. LCSR)
  • word POS (often preserved in translations, parituclarly if fairly literal)
  • cognates (when alphabets are similar)
  • ...and various other things.


Unsorted

Various sources of translations are useful as bitexts, including sites that provide movie subtitles, and more ready-made corpora such as EuroParl


Smooth Injective Map Recognizer (SIMR)

Geometric Segment Alignment (GSA)

See also

  • [Copora]