Computational linguistics

From Helpful
Jump to navigation Jump to search

Computational linguistics refers to the use of computers on language or text analysis.


Includes tasks like


NLP, NLU, and more

The more mechanical side

NLP has grown very broad.


With a lot of statistical and machine learning techniques around, there is a distinction you can make between

  • more mechanical approaches, which might end up with complete annotations
  • more statistical tasks, which

Tokenization, segmentation, boundaries and splitting

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Tokenization refers to dividing free text into tokens. Usually the smallest units you want to use for a particular analysis/computation.

For example, This, that, and foo. might turn into This ,   that ,   and   foo ..

You could also tokenize into sentences, lexical units or such, morphemes, lexemes, though you start running into fuzziness of boundaries, overlaps, and such/


Tokenization can be a useful organizational tool, to avoid your transforming tasks from interfering with each other. Consider that various analyses, tagging processes, transformations, groupings and such (POS tagging, stemming, inflection marking, phrase identification etc.) all want to work on the same token stream, but any changes from one task could interfere with every other task.

Tokenization may help by having tokens be some realization of object/attribute, but also be in the way if under-modelled for the tasks at hand, and it is also quite easy to create token-based frameworks that are restrictive, prohibitively bothersome, or both. It is generally suggested that data structures be kept as simple as possible, though there is some question to whether that's for a given task or in general, which is something that particularly language processing frameworks have to deal with.


Words

Taking words out of text is perhaps the most basic thing you may want to do.

In many language scripts, word boundaries are marked, many by delimiting with a space and some with things like word-final markers.

There are also scripts whose orthograhy does not include such marking. For example, reading Chinese involves a mental process of splitting out concept boundaries, which is sometimes fairly ambiguous without semantics of the context. Japanese does not have spaces either (though can assist one's reading by alternating kanji and kana at concept boundaries, where possible).


Sentences

Taking unstructured text or speech, and demarking where individual sentences are.

Often supports other tasks, e.g. for analysis to focus on structure within sentences. Sometimes as a more direct part of the point, as in text summarization.


Depending on a language's orthography and phonological structure, this task is not at all trivial when you want good accuracy on real-world text, even just on well-structured text. Utterances are much harder. In many languages, abbreviations also make things much harder.

Many initial and still-current implementations focus on mechanical, token-based rules, while in reality there are a lot of exception cases that can only be resolved based on content and context. Some can be dealt with with specific exceptions, but in the long term it is less messy and more accurate to use statistical, collocative analysis.


Notes:

  • sometimes, terms like 'sentence boundary detection' refer more specifically to adding sentence punctuation to punctuation-less text such as telegrams (somewhat comparable to word boundary detection in text and sound, and line break detection in text).


In writing systems that use capitals and punctuation, the simplest approach to text sentence splitting is splitting on periods, question marks and exclamation marks, which may get expanded with conditions such as 'followed by a space and a capital letter', 'not prepended by what looks like a common title or honorific', and such. This is a method that will always be fragile to specific types of writing and certain types of text, so will will never really yield a widely robust method. However, a day or two worth of rule development, ordering, and further tweaking can get you very decent results for corpus analysis needs.

It has been suggested that more robustness can come from adding functional sentence analysis, such as looking for presence/absence of (main) verbs, and other simple models of what we ourselves do when listening to sentences. Humans too will usually keep recent sentence parts in mind and will go back to verify. For example, in "It was due Friday at 5 p.m. Saturday afternoon would be too late.", it is probably only at would that many people realize that a new sentence probably started at Saturday. Perhaps a good heuristic for whether a split is sensible is looking at completeness of, considering verbs and perhaps other things (that don't necessarily need parsing or tagging, at least in a simple version).


Note that there is an upper limit on accuracy anyway - there are enough cases where people will not come to an agreement exactly where the splits really are.



Potential complications, particularly to the dots-and-context approach:

  • a period not used as a full-stop, such as various types of
    • titles/honorifics: Dr., Mrs. Dres.
    • acronyms/initialisms with periods, for example There are no F.C.C. issues., are often not sentence boundaries -- but they can be, when they sit at the end of a sentence.
    • similarly marked abbreviations (possibly incorrectly), such as feat., vs., and such.
    • numbers, e.g. The 3.048 meter pole and perhaps You have $10.33., and even The 2. meter pole.
  • quotes:
    • He asked me "where are you headed?" What? There. "What? There!" "What?" "There!" "Wheeee" said I. (with possible complications in the form of spaces, and other minor variations)
    • apostrophes can esily mess up single quotes balancing rules. Consider e.g. 'tis not for thee.
    • fancy quotes, mixed quote use
  • stylistic text (use/abuse)
    • list items often have no period, but are usually (considered to be) sentences
    • headings that continue in the text
    • You're probably designing for English by now (a trainable model may be more practical)
  • unusual syntax, such as bad internet english, l33t, computer code, and others


Interesting things to think about - and possible rules:

  • quoted text (and parenthetical text) can be considered sub-sentences; you may want to model that.
  • sentences cannot start with certain punctuation (period, comma, percent sign, dashes, etc.)
  • Sentence separators (/ sub-sentence) usually happen at/near characters like " - --
  • A period followed by a digit is often not a sentence boundary
  • A complete set of titles/honorifics is basically never complete - and a pattern like [A-Z][A-zA-Z][.] won't always do either.
  • .!? split (sub)sentences, and may generally be followed by " or )
  • 'the next character is a capital' is decent verification for a break at the current position -- but not always; names, proper nouns and nonsense triggers for the verification can mess that up.
  • sentences usually don't end with titles/honorifics (but can)
  • sentences rarely end with initials (but can).
  • sentences sometimes end with abbreviations.
  • balanced parentheses and balanced double-quotes are a matter of style, and may differ (see e.g. conventions in book dialogies) -- you probably want to do basic text analysis to see whether this is a useful constraint or not.


Other methods, that try to be less content-naive, include:

  • looking for patterns (in words and/or their part of speech) to judge whether any particular sentence split is likely correct
  • using sets of rules, heuristics, or probabilities (often token-based)
  • interaction with named entity recognition / name recognition
  • Information from overall text, e.g. positions of paragraphs, indented lines, and such
  • interaction with a POS tagging
  • ...and combinations.


Note that models that consider only direct context (e.g. simple regexps, token n-grams for small n, certain trained models) easily have an upper limit on accuracy, as the same direct context does not always mean identical sentence split validity.


Unsorted related links:


See also

Word boundary detection

Word boundary detection is useful when

http://icu-project.org/docs/papers/text_boundary_analysis_in_java/

Character boundary detection

Character boundary detection is relevant in Unicode, in that you can't split a string into multiple strings just anywhere. Consider pairs of a letter and a combining-accent pairs, Unicode characters that are drawn differently when combined, Unicode surrogates, and such.

-->


Chunking

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Where a parser is understood as giving a more complete derivation of all the structure in a sentence (which is hard to do with good precision), a chunker can be seen as a simpler alternative, considered shallow parsing in that it does not try to do everything.

It still does a lot more than nothing, and it tries to be okay at the one job it chooses, which can include detecting noun groups, e.g. detecting phrases with a good guess as to what their head is, possibly also trying verb groups, verb phrases. Maybe partitioning a sentence into NP, NP, prepositions, other.

Their output is often also simple - nothing recusive, nothing overlapping. See also IOB For certain tasks, that's enough.


Now that do-it-all parsers are more common, chunking for the "do a select job decently" is less common, but doing a specific job based on a fuller parse (e.g. detecting noun phrases) may be more common, and maybe more precise than classical chunkers because it gets to use more contextual information that the parser found. You can argue whether that is still called chunking.

parsing that provides a partial syntactic structure of a sentence, with a limited tree depth, as opposed to full on parsing


Named entity recognition

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Named-Entity Recognition (NER) considers Named Entities to by any concept for which we use fairly consistent names.

This could be considered two tasks in one: recognizing such a name, and classifying it as a particular type of entity.


And typically, the aim is to specifically avoid a precompiled list (a 'gazetteer'), and try to generalize a little so we can find things that fit a pattern even we don't know the specific case.


There is no singular definition of the kinds of things that includes, but it often focuses on

  • names of people,
  • names of organizations,
  • names of locations,

and also

  • medical codes and other consistently encoded things,
  • quantities
  • dates
  • times
  • money amounts

...probably in part because those are pretty predictable, though note that none of those are even names, they're just useful things to tag.


Stemming, lemmatization

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Stemming

Lemmatization

Algorithms and software

  • Dawson [1]
    • From 1974. Similar to Lovins; intended as a refinement.
  • PyStemmer[2]
    • Mostly a Python wrapper for Snowball's library.
    • Works on English, German, Norwegian, Italian, Dutch, Portuguese, French, Swedish. (verify)
  • Snowball[3]
    • mostly does suffix stripping.
    • Works on English, French, German, Dutch, Spanish, Portuguese, Italian, Swedish, Norwegian, Danish. (verify)
    • Under BSD license[4].
  • Krovetz [6]
    • From 1993(verify)
    • Accurate inflectional stemmer, but complicated and not very powerful.
    • R. Krovetz (1993) Viewing morphology as an inference process
  • Lovins
    • From 1960
    • See also [7]. [8]
    • See also JB Lovins (1968) Development of a stemming algorithm
  • Lancaster (Paice/Husk) [9]
    • a.k.a. Lancaster (Paice/Husk) stemmer
    • From 1990
    • Fairly fast, and relatively simple. Aggressively removes endings (sometimes more than you would want).
    • See also [10]
  • RSLP stemmer (Removedor de Sufixos da Lingua Portuguesa (RSLP))[12] -


See also (lemmatization)

  • Paice, C.D. (1996) Method for Evaluation of Stemming Algorithms based on Error Counting



Speech segmentation

The act of recognizing word boundaries in speech, and the artificial imitation to do the same in phonetic data.

Can also refer to finding syllables and phonemes.

Corpus linguistics

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Any study of language through (usually large) samples. Now usually refers to computer-based analysis.



Annotation

Annotated data is much easier to learn from, while learning from non-annotated corpora avoids bias to annotation errors.

Manually annotated corpora are typically better (up to the point where human disagreement happens), but are costly in terms of time and/or money to create. Semi-automatic methods are often used but only help to some degree.

Automating accurate annotation a useful thing, though it is arguably more application than a field in itself.


Lexical annotation

Probably the most common: Adds lexical category to words, according to a particular tagset, and this implies knowing the lemma (...or having such knowledge in the trained data that e./g. statistical parsers are based on).


Phonetic annotation

Records how words were pronounced, at least phonemically, sometimes with more detailed phonetic information and possibly prosody (intonation, stress).


Semantic annotation

Disambiguates word senses. Relatively rare as it is costly to do manually (as any annotation is) and hard to do automatically.


Pragmatic and Discourse/Conversation annotation

(Sociological) discourse/conversation analysis is often served by systems of informed transciption/annotation is served by extra annotation by someone present or viewing video, since discourse sees a lot of incomplete, ungrammatical sentences that may depend on physical context, pauses, gestures and more - which is also crossing into pragmatic annotation.


There is also analysis that tries to extract e.g. reference across sentences (to resolve cross-sentence ellipsis). Systems such as Discourse Representation Theory (DRT) are formal analysis/specification systems.


See also


Distributional similarity

The observation is that words that occur in the same contexts tend to have similar meanings (Harris, 1954) -- "A word is characterized by the company it keeps" (Firth, 1957).


Collocation

A collocation refers to words that occur together more often than basic probability would suggest (often based on word-level statistics).


Collocation analysis studies the adjacency of words. This often easily turns up things like

  • preferences between words that (lexically speaking) might be interchangeable, some of which much stronger than others (consider the alternatives not chosen. Not necessarily wrong, but certainly marked)
    • verb-preposition pairs, e.g. agree with
    • adjective-noun, e.g. maiden voyage, excruciating pain, spare time,
    • verbs-noun, e.g. commit suicide, make a bed
    • verb-adverb, e.g. prepare feverishly, wave frantically
    • adverb-adjective, e.g. downright amazed, fully aware
    • noun combinations: a surge of anger, a bar of soap, a ceasefire agreement
    • ...and more


See e.g. http://www.ozdic.com/ for examples

Moving towards meaning

Often shallow (as in mechanical rather than understanding) but good enough for many uses.

Entity Linking

Entity Linking means assigning a unique identity to detected entities.

Wikipedia's example of "Paris is the capital of France" points out that NER can be fairly sure that Paris is an enitity at all, but only EL commits to trying to figure out that Paris is probably a city, and a specific one (among multiple called Paris), and not e.g. a person.

This may then go on to try to extract more formal ontology, and as such can also be useful in assisting annotation.


https://en.wikipedia.org/wiki/Entity_linking

Entailment

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Given two text fragments, (textual) entailment tries to answer whether (most people would agree that) the meaning of one can be inferred (is entailed) from the other.


Usually one of the two fragments is a short, simple hypothesis construction. To steal some examples:

Fragment Hypothesis fragment
Sir Ian Blair, the Metropolitan Police Commissioner, said, last night, that his officers were 'playing out of their socks', but admitted that they were 'racing against time' to track down the bombers. Sir Ian Blair works for the Metroplitan Police.
Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year Yahoo acquired Overture
(based on question "Who aquired Overture" and fragment)

Hypotheses might well be based on the fragment itself (e.g. in information extraction), or might be matched with fragment (e.g. in question answering). Many of the papers on entailment argue for what does and what does not constitute entailment.


Matching is often relatively basic, with little consideration of modifiers, quantifiers, conjunctions, complex syntactical structures, complex semantical structures, or pragmatics.


Considered a fairly generic NLP task, potentially useful to various others. consider:

Examples of entailment:



See also:


Coreference resolution

Coreference resolution

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


A referent is a person or thing you can refer to by name.


Words are co-referential if multiple words refer to the same thing - the same referent. For example, in 'John had his dog with him', 'John' and 'him' are co-referential.

This is common in most natural language, where more than one clause/sentence is talking about the same thing, and repeating the full word feels quite redundant (unless used for rhetorical anaphora - repetition for emphasis - which confusingly is a different concept from linguistic anaphora which is closely related to references) (potentially to the point of semantic satiation).


When parsing meaning, it is often useful to find (co)referents, particularly since pronouns (and other pro-form constructions) otherwise empty.


It might have been called reference resolution, but seems to be called coreference resolution because typically both words are present in the sentence.

(If not, e.g. It in It's raining or a vague unspecified they in "they say", there is usually little to resolve).


Coreference resolution is often eased by words marked for grammatical features like number, gender, case, tense, person, and such, by tendencies like we tend to only refer to relatively recently mentioned concepts, and such


http://nlpprogress.com/english/coreference_resolution.html

https://nlp.stanford.edu/projects/coref.shtml

https://en.wikipedia.org/wiki/Coreference

Text summarization

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Text summarization usually refers to the (automated) process of taking a text and producing anything from a subject to a naturally formatted summary.


Basic implementations do only sentence extraction, then simplify the whole by choosing, choosing which sentences are more interesting to keep, which may be as simple as some keyword-based weighing.


More complex implementations may do deeper parsing, and/or generate new text based on some semantically representative intermediate.

Natural language generation

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Natural language generation (NLG) can be taken as a subfield of Natural Language Processing, but is primarily concerned with generating linguistic expressions, often from knowledge bases and logic.


http://en.wikipedia.org/wiki/Natural_language_generation



Question answering

Machine translation

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Machine translation mostly concerns itself with transfer of lexical and semantic information between sentences in different languages.

Semantic and pragmatic transfer is ideal, but most current models are not advanced enough to deal with those (though may of course do so in in accidental easy/best cases).


Text alignment

Statistical Machine Translation