Computational linguistics
Computational linguistics refers to the use of computers on language or text analysis.
Includes tasks like
NLP, NLU, and more
The more mechanical side
NLP has grown very broad.
With a lot of statistical and machine learning techniques around,
there is a distinction you can make between
- more mechanical approaches, which might end up with complete annotations
- more statistical tasks, which
Tokenization, segmentation, boundaries and splitting
Tokenization refers to dividing free text into tokens. Usually the smallest units you want to use for a particular analysis/computation.
For example, This, that, and foo. might turn into This , that , and foo ..
You could also tokenize into sentences, lexical units or such, morphemes, lexemes, though you start running into fuzziness of boundaries, overlaps, and such/
Tokenization can be a useful organizational tool, to avoid your transforming tasks from interfering with each other. Consider that various analyses, tagging processes, transformations, groupings and such (POS tagging, stemming, inflection marking, phrase identification etc.) all want to work on the same token stream, but any changes from one task could interfere with every other task.
Tokenization may help by having tokens be some realization of object/attribute, but also be in the way if under-modelled for the tasks at hand, and it is also quite easy to create token-based frameworks that are restrictive, prohibitively bothersome, or both. It is generally suggested that data structures be kept as simple as possible, though there is some question to whether that's for a given task or in general, which is something that particularly language processing frameworks have to deal with.
Words
Taking words out of text is perhaps the most basic thing you may want to do.
In many language scripts, word boundaries are marked, many by delimiting with a space and some with things like word-final markers.
There are also scripts whose orthograhy does not include such marking. For example, reading Chinese involves a mental process of splitting out concept boundaries, which is sometimes fairly ambiguous without semantics of the context. Japanese does not have spaces either (though can assist one's reading by alternating kanji and kana at concept boundaries, where possible).
Sentences
Taking unstructured text or speech, and demarking where individual sentences are.
Often supports other tasks, e.g. for analysis to focus on structure within sentences. Sometimes as a more direct part of the point, as in text summarization.
Depending on a language's orthography and phonological structure, this task is not at all trivial when you want good accuracy on real-world text, even just on well-structured text. Utterances are much harder. In many languages, abbreviations also make things much harder.
Many initial and still-current implementations focus on mechanical, token-based rules, while in reality there are a lot of exception cases that can only be resolved based on content and context. Some can be dealt with with specific exceptions, but in the long term it is less messy and more accurate to use statistical, collocative analysis.
Notes:
- sometimes, terms like 'sentence boundary detection' refer more specifically to adding sentence punctuation to punctuation-less text such as telegrams (somewhat comparable to word boundary detection in text and sound, and line break detection in text).
- Sentence extraction can refer to a specific approach, a shallow statistical approach that takes sentences and tries to determine saliency. It does not involve knowledge of meaning or structure. It is often seen in (and as) text summarization.
In writing systems that use capitals and punctuation, the simplest approach to text sentence splitting is splitting on periods, question marks and exclamation marks, which may get expanded with conditions such as 'followed by a space and a capital letter', 'not prepended by what looks like a common title or honorific', and such. This is a method that will always be fragile to specific types of writing and certain types of text, so will will never really yield a widely robust method. However, a day or two worth of rule development, ordering, and further tweaking can get you very decent results for corpus analysis needs.
It has been suggested that more robustness can come from adding functional sentence analysis, such as looking for presence/absence of (main) verbs, and other simple models of what we ourselves do when listening to sentences. Humans too will usually keep recent sentence parts in mind and will go back to verify. For example, in "It was due Friday at 5 p.m. Saturday afternoon would be too late.", it is probably only at would that many people realize that a new sentence probably started at Saturday. Perhaps a good heuristic for whether a split is sensible is looking at completeness of, considering verbs and perhaps other things (that don't necessarily need parsing or tagging, at least in a simple version).
Note that there is an upper limit on accuracy anyway - there are enough cases where people will not come to an agreement exactly where the splits really are.
Potential complications, particularly to the dots-and-context approach:
- a period not used as a full-stop, such as various types of
- titles/honorifics: Dr., Mrs. Dres.
- acronyms/initialisms with periods, for example There are no F.C.C. issues., are often not sentence boundaries -- but they can be, when they sit at the end of a sentence.
- similarly marked abbreviations (possibly incorrectly), such as feat., vs., and such.
- numbers, e.g. The 3.048 meter pole and perhaps You have $10.33., and even The 2. meter pole.
- quotes:
- He asked me "where are you headed?" What? There. "What? There!" "What?" "There!" "Wheeee" said I. (with possible complications in the form of spaces, and other minor variations)
- apostrophes can esily mess up single quotes balancing rules. Consider e.g. 'tis not for thee.
- fancy quotes, mixed quote use
- stylistic text (use/abuse)
- list items often have no period, but are usually (considered to be) sentences
- headings that continue in the text
- You're probably designing for English by now (a trainable model may be more practical)
- unusual syntax, such as bad internet english, l33t, computer code, and others
Interesting things to think about - and possible rules:
- quoted text (and parenthetical text) can be considered sub-sentences; you may want to model that.
- sentences cannot start with certain punctuation (period, comma, percent sign, dashes, etc.)
- Sentence separators (/ sub-sentence) usually happen at/near characters like " - --
- A period followed by a digit is often not a sentence boundary
- A complete set of titles/honorifics is basically never complete - and a pattern like [A-Z][A-zA-Z][.] won't always do either.
- .!? split (sub)sentences, and may generally be followed by " or )
- 'the next character is a capital' is decent verification for a break at the current position -- but not always; names, proper nouns and nonsense triggers for the verification can mess that up.
- sentences usually don't end with titles/honorifics (but can)
- sentences rarely end with initials (but can).
- sentences sometimes end with abbreviations.
- balanced parentheses and balanced double-quotes are a matter of style, and may differ (see e.g. conventions in book dialogies) -- you probably want to do basic text analysis to see whether this is a useful constraint or not.
Other methods, that try to be less content-naive, include:
- looking for patterns (in words and/or their part of speech) to judge whether any particular sentence split is likely correct
- using sets of rules, heuristics, or probabilities (often token-based)
- interaction with named entity recognition / name recognition
- Information from overall text, e.g. positions of paragraphs, indented lines, and such
- interaction with a POS tagging
- ...and combinations.
Note that models that consider only direct context (e.g. simple regexps, token n-grams for small n, certain trained models) easily have an upper limit on accuracy, as the same direct context does not always mean identical sentence split validity.
Unsorted related links:
- http://www.ibm.com/developerworks/java/library/j-boundaries/boundaries.html
- https://apps.lis.uiuc.edu/wiki/display/MONK/Sentence+Splitting+in+MorphAdorner+(archive)
- http://nltk.org/doc/guides/tokenize.html#punkt-tokenizer
See also
- Grefenstette, Tapanainen, "What is a Word, What is a Sentence? {Problems} of Tokenization"
- Computational aspects of phrases and clauses
Clause segmentation, clause extraction
Word boundary detection
Word boundary detection is useful when
http://icu-project.org/docs/papers/text_boundary_analysis_in_java/
Character boundary detection
Character boundary detection is relevant in Unicode, in that you can't split a string into multiple strings just anywhere. Consider pairs of a letter and a combining-accent pairs, Unicode characters that are drawn differently when combined, Unicode surrogates, and such.
It, somewhat more accurately, uses the term Grapheme Cluster Boundaries
https://www.unicode.org/reports/tr29/tr29-9.html
Chunking
Chunking often refers to phrase chunking, which identifies and marks phrases, usually focusing on noun phrases, verb phrases, and sometimes prepositional phrases.
Where a fuller parser is understood as giving a more complete derivation of most of the the structure in a sentence (which is hard to do with good precision), a chunker can be seen as a "do one task (preferably well)"
For example, a chunker may do nothing more than detecting noun groups, e.g. detecting phrases with a good guess as to what their head is, possibly also trying verb groups, verb phrases. Maybe partitioning a sentence into NP, NP, prepositions, other.
And nothing more than that.
Phrase identification is in many ways the same problem, but can refer to a more loosely described search for compositional word groups.
Note that there is an upper limit on this. Consider for example that the use of multiple modifiers on the same noun is only so compositional, humans may disagree on details, and converting to semantic representaiton may depend on real-world knowledge.
approaches include:
- marking of known phrases
- looking for word/POS patterns (statistics)
- rough estimation by breaking on various closed class words and punctuation
Their output is often also simple - nothing recusive, nothing overlapping.
Now that do-it-all parsers are more common, chunking for the "do a select job decently" is less common,
but doing a specific job based on a fuller parse (e.g. detecting noun phrases) may be more common,
and maybe more precise than classical chunkers because it gets to use more contextual information that the parser found.
You can argue whether that is still called chunking.
parsing that provides a partial syntactic structure of a sentence, with a limited tree depth, as opposed to full on parsing
Stemming, lemmatization
Stemming
Lemmatization
Algorithms and software
- Dawson [1]
- From 1974. Similar to Lovins; intended as a refinement.
- PyStemmer[2]
- Mostly a Python wrapper for Snowball's library.
- Works on English, German, Norwegian, Italian, Dutch, Portuguese, French, Swedish. (verify)
- Snowball[3]
- Porter [5]
- From 1980(verify)
- Fairly simple and fast. Likely the most widely used method.
- See also http://tartarus.org/~martin/PorterStemmer/
- See also Porter (1980) An algorithm for suffix stripping
- Krovetz [6]
- From 1993(verify)
- Accurate inflectional stemmer, but complicated and not very powerful.
- R. Krovetz (1993) Viewing morphology as an inference process
- Lancaster (Paice/Husk) [9]
- a.k.a. Lancaster (Paice/Husk) stemmer
- From 1990
- Fairly fast, and relatively simple. Aggressively removes endings (sometimes more than you would want).
- See also [10]
- Oleander [11]
- RSLP stemmer (Removedor de Sufixos da Lingua Portuguesa (RSLP))[12] -
- UEA [13]
See also (lemmatization)
- Paice, C.D. (1996) Method for Evaluation of Stemming Algorithms based on Error Counting
- stemming bibliographies such as http://www.comp.lancs.ac.uk/computing/research/stemming/general/bibliography.htm
Speech segmentation
The act of recognizing word boundaries in speech, and the artificial imitation to do the same in phonetic data.
Can also refer to finding syllables and phonemes.
Named entities
Named entities usually refer to finding/recognizing phrases that are (used as) nominal compounds.
Systems often deal with entities such as persons, organizations, locations, named objects, and such, particularly when they can work from known lists.
The same systems often also extract simple references such as times and dates, quantities such as monetary values and percentages.
Specific tasks in the area may be referred to / known as:
- Entity Extraction
- Entity Identification (EI)
- Named Entity Extraction (NEE)
- Named Entity Recognition (NER)
- Named Entity Classification (NEC)
- ...and others.
See also
Named entity recognition
Named-Entity Recognition (NER) considers Named Entities to by any concept for which we use fairly consistent names.
This could be considered two tasks in one: recognizing such a name, and classifying it as a particular type of entity.
And typically, the aim is to specifically avoid a precompiled list (a 'gazetteer'),
and try to generalize a little so we can find things that fit a pattern
even we don't know the specific case.
There is no singular definition of the kinds of things that includes, but it often focuses on
- names of people,
- names of organizations,
- names of locations,
and also
- medical codes and other consistently encoded things,
- quantities
- dates
- times
- money amounts
...probably in part because those are pretty predictable, though note that none of those are even names, they're just useful things to tag.
Corpus linguistics
Any study of language through (usually large) samples. Now usually refers to computer-based analysis.
Annotation
Annotated data is much easier to learn from, while learning from non-annotated corpora avoids bias to annotation errors.
Manually annotated corpora are typically better (up to the point where human disagreement happens), but are costly in terms of time and/or money to create. Semi-automatic methods are often used but only help to some degree.
Automating accurate annotation a useful thing, though it is arguably more application than a field in itself.
Lexical annotation
Probably the most common: Adds lexical category to words, according to a particular tagset, and this implies knowing the lemma (...or having such knowledge in the trained data that e./g. statistical parsers are based on).
Phonetic annotation
Records how words were pronounced, at least phonemically, sometimes with more detailed phonetic information and possibly prosody (intonation, stress).
Semantic annotation
Disambiguates word senses. Relatively rare as it is costly to do manually (as any annotation is) and hard to do automatically.
Pragmatic and Discourse/Conversation annotation
(Sociological) discourse/conversation analysis is often served by systems of informed transciption/annotation is served by extra annotation by someone present or viewing video, since discourse sees a lot of incomplete, ungrammatical sentences that may depend on physical context, pauses, gestures and more - which is also crossing into pragmatic annotation.
There is also analysis that tries to extract e.g. reference across sentences (to resolve cross-sentence ellipsis). Systems such as Discourse Representation Theory (DRT) are formal analysis/specification systems.
See also
- Wikipedia:Corpus linguistics
- http://www.athel.com/corpus.html - fairly large index of related sites
- http://www.natcorp.ox.ac.uk/corpora.html
- http://ahds.ac.uk/creating/guides/linguistic-corpora/chapter2.htm
- http://www.comp.lancs.ac.uk/ucrel/annotation.html
- http://bowland-files.lancs.ac.uk/monkey/ihe/linguistics/corpus4/4fra1.htm
Distributional similarity
The observation is that words that occur in the same contexts tend to have similar meanings (Harris, 1954) -- "A word is characterized by the company it keeps" (Firth, 1957).
Collocation
A collocation refers to words that occur together more often than basic probability would suggest (often based on word-level statistics).
Collocation analysis studies the adjacency of words. This often easily turns up things like
- preferences between words that (lexically speaking) might be interchangeable, some of which much stronger than others (consider the alternatives not chosen. Not necessarily wrong, but certainly marked)
- verb-preposition pairs, e.g. agree with
- adjective-noun, e.g. maiden voyage, excruciating pain, spare time,
- verbs-noun, e.g. commit suicide, make a bed
- verb-adverb, e.g. prepare feverishly, wave frantically
- adverb-adjective, e.g. downright amazed, fully aware
- noun combinations: a surge of anger, a bar of soap, a ceasefire agreement
- ...and more
- MWEs of some kind
See e.g. http://www.ozdic.com/ for examples
Unsorted tasks (mostly other than parsing)
Word-level
Hyphenation, syllabization
Syllabization is about spoken sounds - syllables.
Hyphentation is about typesetting - graphemes. It follows readability -- which usually means it follows sounds and/or morphemes. It is also not very strictly defined.
Syllabization
Syllabification (syllabication, sometimes syllabization) breaks a word into groups of constituent sounds.
This isn't always a hard, clear-cut grouping. Ambisyllabicity refers to a sound being the coda of one and the onset of the next syllable. For example, the t in bottle is the onset of the second, but also the coda of the first syllable (it helps that this is a stop).
There are various patterns/observations (some are basically rules), including:
- every syllable needs a vowel sound
- amount of syllables == amount of vowel sounds (always?)
- digraphs are not divided
- consonant blends are not divided
- compound words will split their parts
- presence of x or ck: usually divided after
- adjacent consonants: often split there
- vcv: often split after the c
http://english.glendale.cc.ca.us/syllables.html http://www.createdbyteachers.com/syllablerulescharts.html http://teacher.scholastic.com/reading/bestpractices/phonics/syllabication.pdf
Hyphenation
Hyphenation (usually) refers to inserting a hyphen (with variation in typography) in a word, for one of many reasons, including...
Typographic hyphenation
One common use appears in typesetting, for layouting reasons: hyphens often break long words between two lines, to avoid having the line be wider than the rest of the paragraph or, in full justification, to avoid having to insert very wide spaces between all words.
Since this is primarilt about readability, (which, yes, is guided by morphemes and pronunciatin) it turns out to be mostly about what to avoid, more about acceptability than (singlar) correctness.
That leaves plenty of room for you to be lost in purposeless pedantry trying to create singular correctness.
There are many possible rules, plenty of exceptions to those, a good set of words with disagreements.
Such hyphenation is most important in good-looking fully-justified text, and less important in right-ragged text. For the latter, no hyphenation is better than bad hyphenation, though particularly long words cannot be escaped, so some minimal rules is a good idea.
A decent read: http://www.melbpc.org.au/pcupdate/9100/9112article4.htm
Dialect-specific
The decision of where to break a word in typesetting can vary. For example for English, US hyphenation is based on pronunciation so will generally follow syllable edges, while UK hyphenation plays by etymology/morphemes first, sound second. For example: the hyphenation of mechanism is mechan-ism in the UK, and mecha-nism in the US. That of progress is pro-gress in the UK, prog-ress in the US.
For both US and UK English, style guides will add some rules, for example avoiding breaks after just the first character (e.g. o-riginally), and avoid cases where you may feed misleading expectations on what the rest of the word is, for example in coin-cidence or co-inage.
American English:
- based on sound, both in that it follows stress and tends to follow syllables
British English:
- based on morphology, then sound
Australian English:
Computational hyphenation=
Computational hyphenation has existed for quite a while, for example in anything that needs to do text layouting.
Hyphenation also be helpful to some computational linguistic tasks, such as recognizing similar words (....slightly more strictly than with no word analysis at all...) partly because they'll be pretty consistent e.g. between different inflections and derivations of the same word.
Hyphenating well is a complex task, because it may involve pronunciation, etymology, human-readability, and the combination of those is (in many languages) not be easy to succinctly convert to a few rules.
Such rules may in fact give conflicting suggestions.
It is not as hard to get decent results, though.
For example, LaTeX typesetting guesses possible break points by character context, and puts different weights on each. It uses this very simple model to choose the most likely break point, and prefer easy breaks over more questionable breaks (in words and between words, as this is purely about typesetting).
Such hyphenation is rarely actually correct, in that it doesn't necessarily split on syllables, morphemes, or necessarily do clever things with unusual words such as loan words, nor will all possible break points necessarily turn up, but considering its simplicity, this system works quite well.
In this context, a soft hyphen carries two meanings:
- an invisible character that is a manual hint for automatic hyphenation (a sense used largely used in word processors, and also web browsers)
- a character automatically inserted as part of line breaking. That is, the visual result inserted by the hyphenation process.
See also:
- http://www.anastigmatix.net/postscript/Hyphenate.html (TeX's algorithm - Knuth, Liang)
Readability, convention
- within words
Certain constructions, often morphological (e.g. affixes, agglutination), can introduce letter sequences that are harder to read.
In English, hyphenations are commonly seen to help readability. Cases include
- places where double vowels would be introduced, such as in anti-inflammatory (rather than antiinflammatory)
- cases where the original word was capitalized, such as anti-Semitic (rather than antiSemitic or antisemitic)
Rules like these exist in many languages, though the reasons and details vary - and may not be quite as strict as some people and books claim.
- lexicalized
In English, compounds (of of specific types) can (often because of delexicalization) variably appear with a space, a hyphen, or agglutinated (written together). A number of compounds are hyphenated because of their current state in that process.
There are quite a few constructions that are usually hyphenated (in an almost lexical(ized) way), such as expressions and act as compounds, and some uses of morphemes, such as 'self-' as a prefix.
Some of these things happens more in UK English than in US English.
For example, it is not unusual to see "good-bye" in UK English, where US English would write it as "goodbye". In this case, both English variants allow both cases, but have a preference. In other cases preferences may be is stronger, and the other case is considered to look weird or incorrect.
The patterns that control these things are influenced by a language's tendency to agglutinate, and various other things.
- Compounds
Hyphens can mark cases where multiple words act as a phrase, but could be easily misunderstood or misread.
There are few wide rules and tendencies, a bunch of specific-case rules, and quite a bit of real-world variation.
Notes
- UK English does it more than US English
- rarely done after -ly adverbs or comparatives
- common when there is ambiguity in compound modifier / modified compounds
- Ambiguity depends on case and context, and hyphenation is optional when one meaning is a lot more obvious, and the alternative meaning would either lead to other formulations or not occur to us at all. For example, "High school students", while technically ambiguous, is rarely understood as "school students high on drugs", so the hyphen in high-school is optional.
- noun-verb combinations get this treatment (more than noun-noun)
- seems much more likely when a modifier appears before the noun, much more rarely if it appears after
- considered when a phrase leads to a garden path quality within a sentence, e.g. some uses of ill-advised, level-headed
- older compounds, common compounds, idioms, and anything else collocative may be lexicalixed enough to be less ambiguous today, so see less hyphen use. You would probably read "ill advised patient" as "ill-advised patient", which you wouldn't if 'ill-advised' wasn't a thing.
Lexical hyphens
Heteronyms in different lexical categories may have different hyphenation (and syllabization).
Consider 'invalid' as a noun (in-va-lid) and adjective (in-va-lid), largely due to lexical stress.
See also (hyphenation)
Language detection
Identifying the language of a given piece of text can be handy for various language-specific computational tasks, including text classification, sentence splitting, search fuzzing, and more.
In some ways just any classification problem.
Your method may need to focus on one or more of, and might choose to use all of:
- languages
- writing systems (also e.g. romanizations and such), arguably some real-world resolution you can paste on top of mechanical parts accurately enough
- text encodings (you can normalize this by detecting text coding and using only unicode)
- Note that character set may be a strong indication of the stored language
...and this should probably be a choice, not side effect of your method.
A proper system probably can use all such information, but is not distracted by it. You could e.g. detect text encoding (see e.g. http://chardet.feedparser.org/) and use it to give a mild boost to the languages that encoding is usually used for, and have your actual detection mechanism work on unicode characters to base it on actual human writing and not computer coding.
Lexical methods (recognizing words)
A very simple (e.g. bayesian-style) model where word matches are basically votes for respective languages.
Surprisingly accurate if your to-be-identified text is more than a few words, and you have a relatively complete vocabularies for each of your target languages.
It may be useful to make this more involved, e.g. do word n-grams to model probabilities of words likely to follow others, though there is some chicken-and-egg issue in the lack of normalization.
The larger the text, this likelier that a bunch of words will match even limited lexical resources (though this can vary with how synthetic the language is), although it can be fragile with wordlist length and means of matching.
For robustness on smaller texts you want to have a large enough vocabulary for each langauge, and what constitutes 'enough' can vary with the language, for example with how much affixing and synthesis happens in that language.
Note that you may still want to use probabilities and/or have a lower limit on the amount of indicators required, because things like adopted words and jargon and such may throw off the guess to a historically related or an auxiliary language.
character n-gram
A mechanically simple-enough approach is a statistical n-gram setup.
There are a number of varying implementations of the n-gram idea, based on the observation that alphabet/character usage and language's morphology make for character sequences that point to particular languages.
It is also limited. Consider that related languages share a common origin, and are likely to share both roots and some morphology, particularly languages with a latin or roman background, and differences in spelling and morphology may not stand out very much from the things shared.
Even so, n-gram methods can work well enough based when trained on just a few example documents per language and when asked to classify n-grams from half a dozen words or more.
See also (language detection)
- http://en.wikipedia.org/wiki/Language_identification
- http://en.wikipedia.org/wiki/Wikipedia:Language_recognition_chart
- Beesley (1988) Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Text [14]
- WB Cavnar, JM Trenkle (1994) N-Gram-Based Text Categorization [15]
- A Poutsma (2001) Applying Monte Carlo Techniques to Language Identification [16]
- P Sibun, JC Reynar (1996) Language Identification: Examining the Issues [17]
- G Grefenstette (1995) Comparing Two Language Identification Schemes [18]
- Hughes et al. (2006) Reconsidering Language Identi?cation for Written Language Resources [19]
- EM Gold (1967) Language identification in the limit [20]
- http://www.speech.inesc.pt/~dcaseiro/html/bibliografia.html (someone's bibliography, for both written and spoken language)
Spell checking
Sentiment analysis
In theory, we study affective states. In practice we mostly estimate whether phrasing seems to agree more on a scale between polar opposites like positive/negative (or sometimes _just_ those opposites) -- because we only have some words to go on.
This is usually specifically applied to things like customer reviews,
seeing how happy they are about it, and what we can tie that to.
There are other ideas, like trying to judge how subjective or objective statements are.
Because even a (fuzzy) classifier can help
Sometimes amount to little more than a soft classifier with two outcomes
or, more commonly, a value on that range.
Word sense disambiguation
Taking a word that may have different roles and/or meanings, and figuring out which sense is meant in this context.
Relations more on the grammatical side
Constituency parsing
Dependency parsing
Relations more on the referential side
Pointers to the real world
NER
Entity Linking
Entity Linking means assigning a unique identity to detected entities.
Wikipedia's example of "Paris is the capital of France" points out that
- while NER can be fairly sure that Paris is an enitity at all,
- only EL commits to trying to figure out that Paris is probably
- a city (and not e.g. a person)
- and probably a specific city (among multiple called Paris) (it probably returns some reference to that specific city)
This may then go on to try to extract more formal ontology, and as such can also be useful in assisting annotation.
https://en.wikipedia.org/wiki/Entity_linking
Pointers within the sentence
Semantic role labeling; Relation extraction
Coreference resolution
- A reference' is the act of referencing,
- A referent is a person or thing you can refer to by name,
- Words are co-referential if multiple words refer to the same thing - the same referent.
- a coreference is an instance of one word referring to another
- For example, in 'John had his dog with him', 'John' and 'him' are co-referential, him is a coreference to John.
- Coreference resolution is the act of figuring out what it refers to,
Note that you you can often tell something is a reference before knowing it is a coreference - in "Bill said he would come", Bill may be referring to himself or someone else (which may nor may not be determined in real-world or textual context).
References and coreferences are common in most natural language,
as closeby clause/sentence are often talking about the same thing,
and repeating the full word feels quite redundant (unless used for rhetorical anaphora - repetition for emphasis - which confusingly is a different concept from linguistic anaphora which is closely related to references) (potentially to the point of semantic satiation).
When parsing meaning, it is often useful to find (co)referents, particularly since pronouns (and other pro-form constructions) otherwise empty.
It might have been called reference resolution, but seems to be called coreference resolution because typically both words are present in the sentence.
(If not, e.g. It in It's raining or a vague unspecified they in "they say", there is usually little to resolve).
Coreference resolution is often eased by words marked for grammatical features like number, gender, case, tense, person, and such,
by tendencies like we tend to only refer to relatively recently mentioned concepts,
and such
http://nlpprogress.com/english/coreference_resolution.html
https://nlp.stanford.edu/projects/coref.shtml
https://en.wikipedia.org/wiki/Coreference
Extracting predicates and answering questions
Clause extraction
While useful in other contexts, the focus on individual clauses may be most interesting when trying to figure out what set of predicates are implied by a sentence.
Entailment
Given two text fragments, (textual) entailment tries to answer whether (most people would agree that) the meaning of one can be inferred (is entailed) from the other.
Usually one of the two fragments is a short, simple hypothesis construction. To steal some examples:
Fragment | Hypothesis fragment |
---|---|
Sir Ian Blair, the Metropolitan Police Commissioner, said, last night, that his officers were 'playing out of their socks', but admitted that they were 'racing against time' to track down the bombers. | Sir Ian Blair works for the Metroplitan Police. |
Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year | Yahoo acquired Overture (based on question "Who aquired Overture" and fragment) |
Hypotheses might well be based on the fragment itself (e.g. in information extraction), or might be matched with fragment (e.g. in question answering). Many of the papers on entailment argue for what does and what does not constitute entailment.
Matching is often relatively basic, with little consideration of modifiers, quantifiers, conjunctions, complex syntactical structures, complex semantical structures, or pragmatics.
Considered a fairly generic NLP task, potentially useful to various others. consider:
Examples of entailment:
- document summarization: a summary should be entailed by the text it summarizes
- information extraction: extracted information should be entailed by the text it comes from
- question answering: an answer is entailed by a supporting snippet of text (that fits the question)
- information retrieval: might extract and index useful information (e.g. subject)
- paraphrasing should be fairly symmetrically entailed
See also:
Text summarization
Text summarization usually refers to the (automated) process of taking a text and producing anything from a subject to a naturally formatted summary.
Basic implementations do only sentence extraction, then simplify the whole by choosing,
choosing which sentences are more interesting to keep, which may be as simple as some keyword-based weighing.
More complex implementations may do deeper parsing, and/or generate new text based on some semantically representative intermediate.
Machine translation
Machine translation mostly concerns itself with transfer of lexical and semantic information between sentences in different languages.
Semantic and pragmatic transfer is ideal, but most current models are not advanced enough to deal with those (though may of course do so in in accidental easy/best cases).
Natural language generation
Natural language generation (NLG) can be taken as a subfield of Natural Language Processing, but is primarily concerned with generating linguistic expressions, often from knowledge bases and logic.
This has now been taken over by LLMs.