Computational linguistics

From Helpful
Jump to: navigation, search

Computational linguistics refers to the use of computers on language or text analysis.

Includes tasks like

Tokenization, segmentation, boundaries and splitting

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Tokenization refers to dividing free text into tokens. Usually the smallest units you want to use for a particular analysis/computation.

For example, This, that, and foo. might turn into This ,   that ,   and   foo ..

You could also tokenize into sentences, lexical units or such, morphemes, lexemes, though you start running into fuzziness of boundaries, overlaps, and such/

Tokenization can be a useful organizational tool, to avoid your transforming tasks from interfering with each other. Consider that various analyses, tagging processes, transformations, groupings and such (POS tagging, stemming, inflection marking, phrase identification etc.) all want to work on the same token stream, but any changes from one task could interfere with every other task.

Tokenization may help by having tokens be some realization of object/attribute, but also be in the way if under-modelled for the tasks at hand, and it is also quite easy to create token-based frameworks that are restrictive, prohibitively bothersome, or both. It is generally suggested that data structures be kept as simple as possible, though there is some question to whether that's for a given task or in general, which is something that particularly language processing frameworks have to deal with.


Taking words out of text is perhaps the most basic thing you may want to do.

In many language scripts, word boundaries are marked, many by delimiting with a space and some with things like word-final markers.

There are also scripts whose orthograhy does not include such marking. For example, reading Chinese involves a mental process of splitting out concept boundaries, which is sometimes fairly ambiguous without semantics of the context. Japanese does not have spaces either (though can assist one's reading by alternating kanji and kana at concept boundaries, where possible).


Taking unstructured text or speech, and demarking where individual sentences are.

Often supports other tasks, e.g. for analysis to focus on structure within sentences. Sometimes as a more direct part of the point, as in text summarization.

Depending on a language's orthography and phonological structure, this task is not at all trivial when you want good accuracy on real-world text, even just on well-structured text. Utterances are much harder. In many languages, abbreviations also make things much harder.

Many initial and still-current implementations focus on mechanical, token-based rules, while in reality there are a lot of exception cases that can only be resolved based on content and context. Some can be dealt with with specific exceptions, but in the long term it is less messy and more accurate to use statistical, collocative analysis.


  • sometimes, terms like 'sentence boundary detection' refer more specifically to adding sentence punctuation to punctuation-less text such as telegrams (somewhat comparable to word boundary detection in text and sound, and line break detection in text).

In writing systems that use capitals and punctuation, the simplest approach to text sentence splitting is splitting on periods, question marks and exclamation marks, which may get expanded with conditions such as 'followed by a space and a capital letter', 'not prepended by what looks like a common title or honorific', and such. This is a method that will always be fragile to specific types of writing and certain types of text, so will will never really yield a widely robust method. However, a day or two worth of rule development, ordering, and further tweaking can get you very decent results for corpus analysis needs.

It has been suggested that more robustness can come from adding functional sentence analysis, such as looking for presence/absence of (main) verbs, and other simple models of what we ourselves do when listening to sentences. Humans too will usually keep recent sentence parts in mind and will go back to verify. For example, in "It was due Friday at 5 p.m. Saturday afternoon would be too late.", it is probably only at would that many people realize that a new sentence probably started at Saturday. Perhaps a good heuristic for whether a split is sensible is looking at completeness of, considering verbs and perhaps other things (that don't necessarily need parsing or tagging, at least in a simple version).

Note that there is an upper limit on accuracy anyway - there are enough cases where people will not come to an agreement exactly where the splits really are.

Potential complications, particularly to the dots-and-context approach:

  • a period not used as a full-stop, such as various types of
    • titles/honorifics: Dr., Mrs. Dres.
    • acronyms/initialisms with periods, for example There are no F.C.C. issues., are often not sentence boundaries -- but they can be, when they sit at the end of a sentence.
    • similarly marked abbreviations (possibly incorrectly), such as feat., vs., and such.
    • numbers, e.g. The 3.048 meter pole and perhaps You have $10.33., and even The 2. meter pole.
  • quotes:
    • He asked me "where are you headed?" What? There. "What? There!" "What?" "There!" "Wheeee" said I. (with possible complications in the form of spaces, and other minor variations)
    • apostrophes can esily mess up single quotes balancing rules. Consider e.g. 'tis not for thee.
    • fancy quotes, mixed quote use
  • stylistic text (use/abuse)
    • list items often have no period, but are usually (considered to be) sentences
    • headings that continue in the text
    • You're probably designing for English by now (a trainable model may be more practical)
  • unusual syntax, such as bad internet english, l33t, computer code, and others

Interesting things to think about - and possible rules:

  • quoted text (and parenthetical text) can be considered sub-sentences; you may want to model that.
  • sentences cannot start with certain punctuation (period, comma, percent sign, dashes, etc.)
  • Sentence separators (/ sub-sentence) usually happen at/near characters like " - --
  • A period followed by a digit is often not a sentence boundary
  • A complete set of titles/honorifics is basically never complete - and a pattern like [A-Z][A-zA-Z][.] won't always do either.
  • .!? split (sub)sentences, and may generally be followed by " or )
  • 'the next character is a capital' is decent verification for a break at the current position -- but not always; names, proper nouns and nonsense triggers for the verification can mess that up.
  • sentences usually don't end with titles/honorifics (but can)
  • sentences rarely end with initials (but can).
  • sentences sometimes end with abbreviations.
  • balanced parentheses and balanced double-quotes are a matter of style, and may differ (see e.g. conventions in book dialogies) -- you probably want to do basic text analysis to see whether this is a useful constraint or not.

Other methods, that try to be less content-naive, include:

  • looking for patterns (in words and/or their part of speech) to judge whether any particular sentence split is likely correct
  • using sets of rules, heuristics, or probabilities (often token-based)
  • interaction with named entity recognition / name recognition
  • Information from overall text, e.g. positions of paragraphs, indented lines, and such
  • interaction with a POS tagging
  • ...and combinations.

Note that models that consider only direct context (e.g. simple regexps, token n-grams for small n, certain trained models) easily have an upper limit on accuracy, as the same direct context does not always mean identical sentence split validity.

Unsorted related links:

See also

Word boundary detection

Word boundary detection is useful when

Character boundary detection

Character boundary detection is relevant in Unicode, in that you can't split a string into multiple strings just anywhere. Consider pairs of a letter and a combining-accent pairs, Unicode characters that are drawn differently when combined, Unicode surrogates, and such.


Spell checking

Stemming, lemmatization

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Stemming and lemmatization reduce words to a normalized, uninflected form.

This is often a step in analyses that decide to simplify things with a single pass rather than continuous lookups. (or rougher things, like, inflection-independent type/token ratio statistics).

Know that various studies have concluded that for many languages' writing systems, stemming is not very interesting for search tasks. This in part because stemming itself is way too simple - you mostly just change the behaviour rather than consistently improve it in a way a human would expect (which would probably require lemmatization, synsets, disambiguation, weighing, etc). Results of stemming vary considerably and unpredictably between queries. This varies with the amount of overstemming, which itself depends on what the stemming implementation canonicalizes into the same form - related forms (inflectional and/or derivational) and/or spelling variants, but mostly with whether it tries to get a minimal stem (more likely to collide with unrelated stems). Most problems are caused by derivational morphology as it is in itself an unpredictable transformation (in terms of meaning).

Stemming as well as lemmatization process may be different for different languages, and different language families.


Stemming is a computational-linguistic sort of concept that tries to reduce words to inflectionless strings, often by chopping off what looks like common bound morphemes, and canonicalizing what's left to remove morphophonetic variation.

Stemming is often used as a mildly destructive data normalization step, and works best on more predictable words, where all inflections of the lemma are reduced to the same form. Various stemmers output things that is not the lemma (or even an existing word) but this is not necessary for most uses.

Problems with stemming mostly stem from that transformations like suffix chopping is a little too simple - it is ignorant of any minor morphology and subtleties.

The simplicity of many algorithms often means they assume:

There is often a respectable set of words for which most of those are true. The set for which it isn't turns out to be crucial for human semantics, though.

It does not deal so well with irregularity, polysemy, nor does it consider context or any sort of statistical a priori, and even if it would it would probably not deal that much better with polysemy. It doesn't deal so well with language change either - consider various words that started off as morphological alternatives but grew a more distinct sense.

Stemming can be implemented quite efficiently.

In some applications (e.g. some research), the canonicalization one wishes for is important enough to warrant using more computationally complex methods (like POS tagging, even lemmatization) to do stemming more accurately.


Lemmatization linguistically goes further than stemming, attempting to reduce words into the lemma it was based on.

This is effectively a canonicalization, and can be useful to a larger number of linguistic tasks than stemming is, as various lemmatization algorithms try to be smart about multiple possibilities, context, and such (and can be noticably slower than stemming).

For more accurate results:

  • you can make the lemmatisation process aware of the lexical category
  • you can deal with homographs e.g. by numbering senses, giving lists where multiple possibilities exist, or such
  • you can use context (e.g. to choose the more probable sense)

Accurate lemmatization is hard, and doing it near-perfectly can depend on context, semantics and arguably even pragmatics, so depending on your task, you often end up weighing stemming, simple lemmatization and detailed lemmatization.

Algorithms and software

  • Dawson [1]
    • From 1974. Similar to Lovins; intended as a refinement.
  • PyStemmer[2]
    • Mostly a Python wrapper for Snowball's library.
    • Works on English, German, Norwegian, Italian, Dutch, Portuguese, French, Swedish. (verify)
  • Snowball[3]
    • mostly does suffix stripping.
    • Works on English, French, German, Dutch, Spanish, Portuguese, Italian, Swedish, Norwegian, Danish. (verify)
    • Under BSD license[4].
  • Krovetz [6]
    • From 1993(verify)
    • Accurate inflectional stemmer, but complicated and not very powerful.
    • R. Krovetz (1993) Viewing morphology as an inference process
  • Lovins
    • From 1960
    • See also [7]. [8]
    • See also JB Lovins (1968) Development of a stemming algorithm
  • Lancaster (Paice/Husk) [9]
    • a.k.a. Lancaster (Paice/Husk) stemmer
    • From 1990
    • Fairly fast, and relatively simple. Aggressively removes endings (sometimes more than you would want).
    • See also [10]
  • RSLP stemmer (Removedor de Sufixos da Lingua Portuguesa (RSLP))[12] -

See also

  • Paice, C.D. (1996) Method for Evaluation of Stemming Algorithms based on Error Counting

Speech segmentation

The act of recognizing word boundaries in speech, and the artificial imitation to do the same in phonetic data.

Can also refer to finding syllables and phonemes.

Hyphenation, syllabization

Syllabization is about spoken sounds - syllables.

Hyphentation is about typesetting - graphemes. It follows readability -- which usually means it follows sounds and/or morphemes. It is also not very strictly defined.


Syllabification (syllabication, sometimes syllabization) breaks a word into groups of constituent sounds.

This isn't always a hard, clear-cut grouping. Ambisyllabicity refers to a sound being the coda of one and the onset of the next syllable. For example, the t in bottle is the onset of the second, but also the coda of the first syllable (it helps that this is a stop).

There are various patterns/observations (some are basically rules), including:

  • every syllable needs a vowel sound
    • amount of syllables == amount of vowel sounds (always?)
  • digraphs are not divided
  • consonant blends are not divided
  • compound words will split their parts
  • presence of x or ck: usually divided after
  • adjacent consonants: often split there
  • vcv: often split after the c


Hyphenation (usually) refers to inserting a hyphen (with variation in typography) in a word, for one of many reasons, including...

Typographic hyphenation

One common use appears in typesetting, for layouting reasons: hyphens often break long words between two lines, to avoid having the line be wider than the rest of the paragraph or, in full justification, to avoid having to insert very wide spaces between all words.

Since this is primarilt about readability, (which, yes, is guided by morphemes and pronunciatin) it turns out to be mostly about what to avoid, more about acceptability than (singlar) correctness.

That leaves plenty of room for you to be lost in purposeless pedantry trying to create singular correctness.

There are many possible rules, plenty of exceptions to those, a good set of words with disagreements.

Such hyphenation is most important in good-looking fully-justified text, and less important in right-ragged text. For the latter, no hyphenation is better than bad hyphenation, though particularly long words cannot be escaped, so some minimal rules is a good idea.

A decent read:


The decision of where to break a word in typesetting can vary. For example for English, US hyphenation is based on pronunciation so will generally follow syllable edges, while UK hyphenation plays by etymology/morphemes first, sound second. For example: the hyphenation of mechanism is mechan-ism in the UK, and mecha-nism in the US. That of progress is pro-gress in the UK, prog-ress in the US.

For both US and UK English, style guides will add some rules, for example avoiding breaks after just the first character (e.g. o-riginally), and avoid cases where you may feed misleading expectations on what the rest of the word is, for example in coin-cidence or co-inage.

American English:

  • based on sound, both in that it follows stress and tends to follow syllables

British English:

  • based on morphology, then sound

Australian English:

Computational hyphenation=

Computational hyphenation has existed for quite a while, for example in anything that needs to do text layouting.

Hyphenation also be helpful to some computational linguistic tasks, such as recognizing similar words (....slightly more strictly than with no word analysis at all...) partly because they'll be pretty consistent e.g. between different inflections and derivations of the same word.

Hyphenating well is a complex task, because it may involve pronunciation, etymology, human-readability, and the combination of those is (in many languages) not be easy to succinctly convert to a few rules. Such rules may in fact give conflicting suggestions.

It is not as hard to get decent results, though. For example, LaTeX typesetting guesses possible break points by character context, and puts different weights on each. It uses this very simple model to choose the most likely break point, and prefer easy breaks over more questionable breaks (in words and between words, as this is purely about typesetting).

Such hyphenation is rarely actually correct, in that it doesn't necessarily split on syllables, morphemes, or necessarily do clever things with unusual words such as loan words, nor will all possible break points necessarily turn up, but considering its simplicity, this system works quite well.

In this context, a soft hyphen carries two meanings:

  • an invisible character that is a manual hint for automatic hyphenation (a sense used largely used in word processors, and also web browsers)
  • a character automatically inserted as part of line breaking. That is, the visual result inserted by the hyphenation process.

See also:

Readability, convention

within words

Certain constructions, often morphological (e.g. affixes, agglutination), can introduce letter sequences that are harder to read.

In English, hyphenations are commonly seen to help readability. Cases include

  • places where double vowels would be introduced, such as in anti-inflammatory (rather than antiinflammatory)
  • cases where the original word was capitalized, such as anti-Semitic (rather than antiSemitic or antisemitic)

Rules like these exist in many languages, though the reasons and details vary - and may not be quite as strict as some people and books claim.


In English, compounds (of of specific types) can (often because of delexicalization) variably appear with a space, a hyphen, or agglutinated (written together). A number of compounds are hyphenated because of their current state in that process.

There are quite a few constructions that are usually hyphenated (in an almost lexical(ized) way), such as expressions and act as compounds, and some uses of morphemes, such as 'self-' as a prefix.

Some of these things happens more in UK English than in US English.

For example, it is not unusual to see "good-bye" in UK English, where US English would write it as "goodbye". In this case, both English variants allow both cases, but have a preference. In other cases preferences may be is stronger, and the other case is considered to look weird or incorrect.

The patterns that control these things are influenced by a language's tendency to agglutinate, and various other things.

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Hyphens can mark cases where multiple words act as a phrase, but could be easily misunderstood or misread.

There are few wide rules and tendencies, a bunch of specific-case rules, and quite a bit of real-world variation.


  • UK English does it more than US English
  • rarely done after -ly adverbs or comparatives
  • common when there is ambiguity in compound modifier / modified compounds
    • Ambiguity depends on case and context, and hyphenation is optional when one meaning is a lot more obvious, and the alternative meaning would either lead to other formulations or not occur to us at all. For example, "High school students", while technically ambiguous, is rarely understood as "school students high on drugs", so the hyphen in high-school is optional.
  • noun-verb combinations get this treatment (more than noun-noun)
    • seems much more likely when a modifier appears before the noun, much more rarely if it appears after
  • considered when a phrase leads to a garden path quality within a sentence, e.g. some uses of ill-advised, level-headed
  • older compounds, common compounds, idioms, and anything else collocative may be lexicalixed enough to be less ambiguous today, so see less hyphen use. You would probably read "ill advised patient" as "ill-advised patient", which you wouldn't if 'ill-advised' wasn't a thing.

Lexical hyphens

Heteronyms in different lexical categories may have different hyphenation (and syllabization).

Consider 'invalid' as a noun (in-va-lid) and adjective (in-va-lid), largely due to lexical stress.


Keep in mind that

See also

Language detection

Identifying the language of a given piece of text can be handy for various language-specific computational tasks, including text classification, sentence splitting, search fuzzing, and more.

In some ways just any classification problem.

Your method may need to focus on one or more of, and might choose to use all of:

  • languages
  • writing systems (also e.g. romanizations and such), arguably some real-world resolution you can paste on top of mechanical parts accurately enough
  • text encodings (you can normalize this by detecting text coding and using only unicode)
Note that character set may be a strong indication of the stored language

...and this should probably be a choice, not side effect of your method.

A proper system probably can use all such information, but is not distracted by it. You could e.g. detect text encoding (see e.g. and use it to give a mild boost to the languages that encoding is usually used for, and have your actual detection mechanism work on unicode characters to base it on actual human writing and not computer coding.

Lexical methods (recognizing words)

A very simple (e.g. bayesian-style) model where word matches are basically votes for respective languages.

Surprisingly accurate if your to-be-identified text is more than a few words, and you have a relatively complete vocabularies for each of your target languages.

It may be useful to make this more involved, e.g. do word n-grams to model probabilities of words likely to follow others, though there is some chicken-and-egg issue in the lack of normalization.

The larger the text, this likelier that a bunch of words will match even limited lexical resources (though this can vary with how synthetic the language is), although it can be fragile with wordlist length and means of matching.

For robustness on smaller texts you want to have a large enough vocabulary for each langauge, and what constitutes 'enough' can vary with the language, for example with how much affixing and synthesis happens in that language.

Note that you may still want to use probabilities and/or have a lower limit on the amount of indicators required, because things like adopted words and jargon and such may throw off the guess to a historically related or an auxiliary language.

character n-gram

A mechanically simple-enough approach is a statistical n-gram setup.

There are a number of varying implementations of the n-gram idea, based on the observation that alphabet/character usage and language's morphology make for character sequences that point to particular languages.

It is also limited. Consider that related languages share a common origin, and are likely to share both roots and some morphology, particularly languages with a latin or roman background, and differences in spelling and morphology may not stand out very much from the things shared.

Even so, n-gram methods can work well enough based when trained on just a few example documents per language and when asked to classify n-grams from half a dozen words or more.

See also

  • Beesley (1988) Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Text [14]
  • WB Cavnar, JM Trenkle (1994) N-Gram-Based Text Categorization [15]
  • A Poutsma (2001) Applying Monte Carlo Techniques to Language Identification [16]

  • P Sibun, JC Reynar (1996) Language Identification: Examining the Issues [17]
  • G Grefenstette (1995) Comparing Two Language Identification Schemes [18]
  • Hughes et al. (2006) Reconsidering Language Identi?cation for Written Language Resources [19]
  • EM Gold (1967) Language identification in the limit [20]


Corpus linguistics

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Any study of language through (usually large) samples. Now very usually refers to computer-based analysis.

Examples for unannotated corpora are finding collocations or more semantically related words. For annotated corpora it includes n-gram and tree structure analysis.

Distributional similarity

The observation is that words that occur in the same contexts tend to have similar meanings (Harris, 1954) -- "A word is characterized by the company it keeps" (Firth, 1957).


A collocation refers to words that occur together more often than basic probability would suggest (often based on word-level statistics).

Collocation analysis studies the adjacency of words. This often easily turns up things like

  • preferences between words that (lexically speaking) might be interchangeable, some of which much stronger than others (consider the alternatives not chosen. Not necessarily wrong, but certainly marked)
    • verb-preposition pairs, e.g. agree with
    • adjective-noun, e.g. maiden voyage, excruciating pain, spare time,
    • verbs-noun, e.g. commit suicide, make a bed
    • verb-adverb, e.g. prepare feverishly, wave frantically
    • adverb-adjective, e.g. downright amazed, fully aware
    • noun combinations: a surge of anger, a bar of soap, a ceasefire agreement
    • ...and more

See e.g. for examples


Annotated data is much easier to learn from, while learning from non-annotated corpora avoids bias to annotation errors.

Manually annotated corpora are typically better (up to the point where human disagreement happens), but are costly in terms of time and/or money to create. Semi-automatic methods are often used but only help to some degree.

Automating accurate annotation a useful thing, though it is arguably more application than a field in itself.

Lexical annotation

Probably the most common: Adds lexical category to words, according to a particular tagset, and this implies knowing the lemma (...or having such knowledge in the trained data that e./g. statistical parsers are based on).

Phonetic annotation

Records how words were pronounced, at least phonemically, sometimes with more detailed phonetic information and possibly prosody (intonation, stress).

Semantic annotation

Disambiguates word senses. Relatively rare as it is costly to do manually (as any annotation is) and hard to do automatically.

Pragmatic and Discourse/Conversation annotation

(Sociological) discourse/conversation analysis is often served by systems of informed transciption/annotation is served by extra annotation by someone present or viewing video, since discourse sees a lot of incomplete, ungrammatical sentences that may depend on physical context, pauses, gestures and more - which is also crossing into pragmatic annotation.

There is also analysis that tries to extract e.g. reference across sentences (to resolve cross-sentence ellipsis). Systems such as Discourse Representation Theory (DRT) are formal analysis/specification systems.

See also

Knowledge analysis

Often shallow (as in mechanical rather than understanding) but good enough for many uses.

Named entity recognition


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Given two text fragments, (textual) entailment tries to answer whether (most people would agree that) the meaning of one can be inferred (is entailed) from the other.

Quite usually, one of the two fragments is a short and well-structured hypothesis. To steal some examples:

Hypothesis fragment Fragment
Sir Ian Blair works for the Metroplitan Police. Sir Ian Blair, the Metropolitan Police Commissioner, said, last night, that his officers were 'playing out of their socks', but admitted that they were 'racing against time' to track down the bombers.
Yahoo acquired Overture
(based on question "Who aquired Overture" and fragment)
Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year

Hypotheses might well be based on the fragment itself (e.g. in information extraction), or might be matched with fragment (e.g. in question answering). Many of the papers on entailment argue for what does and what does not constitute entailment.

Matching is often relatively basic, with little consideration of modifiers, quantifiers, conjunctions, complex syntactical structures, complex semantical structures, or pragmatics.

Considered a fairly generic NLP task, potentially useful to various others. consider:

Examples of entailment:

See also:

Human-like tasks

Imitation of human tasks, which can be done well enough by computer for a lot of work.

Text summarization

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Text summarization usualy refers to the (automated) process of taking a text and producing anything from a subject to a naturally formatted summary.

Basic implementations do only sentence extraction, which takes the most interesting sentences from a larger set of sentences, based e.g. on some keyword-based weighing.

More complex implementations may do deep parsing and/or generate new text based on semantic results of analysis.

Natural language generation

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Natural language generation (NLG) can be taken as a subfield of Natural Language Processing, but is primarily concerned with generating linguistic expressions, often from knowledge bases and logic.

Question answering

Machine translation

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Machine translation mostly concerns itself with transfer of lexical and semantic information between sentences in different languages.

Semantic and pragmatic transfer is ideal, but most current models are not advanced enough to deal with those (though may of course do so in in accidental easy/best cases).

Text alignment

Statistical Machine Translation