Computational linguistics: Difference between revisions

From Helpful
Jump to navigation Jump to search
 
(20 intermediate revisions by the same user not shown)
Line 198: Line 198:


==Chunking==
==Chunking==
{{stub}}
Where a fuller [[parser]] is understood as giving a more complete derivation of all the structure in a sentence (which is hard to do with good precision),
a '''chunker''' can be seen as a "do one task (preferably well)"
For example, a chunker may do nothing more than detecting noun groups,
e.g. detecting phrases with a good guess as to what their [[head]] is,
possibly also trying verb groups, verb phrases.
''Maybe'' partitioning a sentence into NP, NP, prepositions, other.
And nothing more than that.
Their output is often also simple - nothing recusive, nothing overlapping.
<!--
They may still run alongside each other, in which case the way of mentioning the output of each becomes more relevant.
See also [[IOB]]
-->
Now that do-it-all parsers are more common, chunking for the "do a select job decently" is less common,
but doing a specific job ''based'' on a fuller parse (e.g. detecting noun phrases) may be more common,
and maybe more precise than classical chunkers because it gets to use more contextual information that the parser found.
You can argue whether that is still called chunking.
parsing that provides a partial syntactic structure of a sentence, with a limited tree depth, as opposed to full on parsing
===Named entity recognition===
{{stub}}
Named-Entity Recognition (NER) considers Named Entities to by any concept for which we use fairly consistent names.
This could be considered two tasks in one:
recognizing such a name,
and classifying it as a particular type of entity.
And typically, the aim is to specifically avoid a precompiled list (a 'gazetteer'),
and try to generalize a little so we can find things that fit a pattern
even we don't know the specific case.
There is no singular definition of the kinds of things that includes, but it often focuses on
* names of people,
* names of organizations,
* names of locations,
and also
* medical codes and other consistently encoded things,
* quantities
* dates
* times
* money amounts
...probably in part because those are pretty predictable, though note that none of those are even names, they're just useful things to tag.


<!--
<!--
A text chunker refers more to a type of work (takes tokens and detects ''stuff'')
'''The simplest approach'''
than what it detects exactly.
 
The simplest way to extract named entities is to make a list of known entities (often called a [[gazetteer]]), and report their use.


This can include
For some applications this goes a long way, but you run into several practical issues,
- detecting noun groups, e.g. detecting phrases with a good guess as to what their [[head]] is.
in particular that while it gives good precision (mostly true positives) it will miss a lot of things (many false negatives - unknown names, and e.g. variant spellings, will always be ignored) simply because you haven't already told it about them.
- possibly moving into verb groups / verb phrases
- partitioning a sentence into NP, NP, prepositions, other.  




Chunking is generally understood as something inbetween a fully derived parse (parsers are hard to build well),
and, well, nothing.


Chunking are a simple estimation of shallow structure of the sentence, enough to say
* We like to think they are consistent names, but they are only ''moderately'' consistent and may only be nonambiguous in sppecific contexts.
''roughly'' what is getting done with the things/people that seem to be involved.


Now that parsers doing more things more automatically are more common,
* Our writing styles like to abbreviate multiple word NERs to a fewer-word or single-word form for practical reasons.
chunking might be less important - or perhaps come to mean picking out the NPs, VPs, and PPs out of a nicer parse
: Even just people's names become first names, last names, may or may not include a title, and so on.
to do some quickj estimation with.


* Multi-word Tokens may be more easily resolves (because they are also collocations - their appearance together is unlikely to mean anything else) but there are plenty of named entities that are single words.


* In any one application, we may want to get more specific, recognizing names of e.g. events, products, languages, sports, works of art
: specific applications may be clean enough for those to work well, where


A chunker is generally understood as something that takes tokens and detects phrases,
and usually from adjacent words - nothing recusive, nothing overlapping.




'''Rules-based NER'''


Some of the above can be resolved just by making a more complete set of rules.
Say, the case of titles and abbreviation isn't too hard to fix.


Where a parser is understood as giving a more complete derivation of all the structure in a sentence,
The upside is that you have good control of what happens,
chunking can be seen as a simpler alternative, that gives
and with enough time and work you can make it work on unseen entities.
 
the main downsides is that
: the complexity ''will'' grow out of control
: it will generalize poorly to real-world text. It may do good on e.g. books because they've seen several passes of editors doing their thing, but even then there are probably variants you're not matching.
 
 
 
'''Machine learning NER'''
 
Machine learning means you leave the decisions up to something that someone else, or you, has trained.
 
The upside is that it can use more features than you could easily put into rules,
and tends to work moderately well on unseen entities in typical contexts,
and that you could teach it incrementally.
 
The downside is that when it estimates incorrectly, you don't have much to say about it
''other'' than doing more annotation to teach it better.
 
 
 
 
 
NER runs into the general issue of linguistic ambiguity
 
'''Domain adaptation'''
 
 
 
 
"Mention detection" is sometimes mentioned as a pre-processing step to named entity recognition,
but also as part of a coreference solution
 
 
 
https://en.wikipedia.org/wiki/Named-entity_recognition


parsing that provides a partial syntactic structure of a sentence, with a limited tree depth, as opposed to full on parsing


Nested Named Entity Recognition
https://paperswithcode.com/task/nested-named-entity-recognition




Line 388: Line 482:


Can also refer to finding syllables and phonemes.
Can also refer to finding syllables and phonemes.
=Corpus linguistics=
{{stub}}
Any study of language through (usually large) samples.
Now usually refers to computer-based analysis.
<!--
Examples for unannotated corpora are finding [[collocation]]s or more semantically related words.
For annotated corpora it includes n-gram and tree structure analysis.
-->
==Annotation==
Annotated data is much easier to learn from, while learning from non-annotated corpora avoids bias to annotation errors.
Manually annotated corpora are typically better (up to the point where human disagreement happens), but are costly in terms of time and/or money to create. Semi-automatic methods are often used but only help to some degree.
Automating accurate annotation a useful thing, though it is arguably more application than a field in itself.
===Lexical annotation===
Probably the most common: Adds [[lexical category]] to words, according to a particular [[tagset]], and this implies knowing the [[lemma]] (...or having such knowledge in the trained data that e./g. statistical parsers are based on).
===Phonetic annotation===
Records how words were pronounced, at least phonemically, sometimes with more detailed phonetic information and possibly [[prosody]] ([[intonation]], [[stress]]).
===Semantic annotation===
Disambiguates word senses. Relatively rare as it is costly to do manually (as any annotation is) and hard to do automatically.
===Pragmatic and Discourse/Conversation annotation===
(Sociological) discourse/conversation analysis is often served by systems of informed transciption/annotation is served by extra annotation by someone present or viewing video, since discourse sees a lot of incomplete, ungrammatical sentences that may depend on physical context, pauses, gestures and more - which is also crossing into pragmatic annotation.
There is also analysis that tries to extract e.g. reference across sentences (to resolve cross-sentence [[ellipsis]]). Systems such as Discourse Representation Theory (DRT) are formal analysis/specification systems.
==See also==
* [[Wikipedia:Corpus linguistics]]
* http://www.athel.com/corpus.html - fairly large index of related sites
* http://www.natcorp.ox.ac.uk/corpora.html
* http://ahds.ac.uk/creating/guides/linguistic-corpora/chapter2.htm
* http://www.comp.lancs.ac.uk/ucrel/annotation.html
* http://bowland-files.lancs.ac.uk/monkey/ihe/linguistics/corpus4/4fra1.htm
[[Category:Computational linguistics]]
[[Category:Computational analysis]]
==Distributional similarity==
The observation is that words that occur in the same contexts tend to have similar meanings (Harris, 1954) -- "A word is characterized by the company it keeps" (Firth, 1957).
===Collocation===
'''A collocation''' refers to words that occur together more often than basic probability would suggest (often based on word-level statistics).
Collocation analysis studies the adjacency of words. This often easily turns up things like
* preferences between words that (lexically speaking) might be interchangeable, some of which much stronger than others (consider the alternatives not chosen. Not necessarily wrong, but certainly [[marked]])
** [[verb]]-[[preposition]] pairs, e.g. agree with
** [[adjective]]-[[noun]], e.g. maiden voyage, excruciating pain, spare time,
** [[verbs]]-[[noun]], e.g. commit suicide, make a bed
** [[verb]]-[[adverb]],  e.g. prepare feverishly, wave frantically
** [[adverb]]-[[adjective]], e.g. downright amazed, fully aware
** [[noun]] combinations: a ''surge'' of ''anger'', a ''bar'' of ''soap'', a ''ceasefire'' ''agreement''
** ...and more
* [[MWE]]s of some kind
<!--
Strong collocations may stay together as single lexical units,
but there are a lot of collocations that seem to come from habitial association, and are more flexible.
-->
See e.g. http://www.ozdic.com/ for examples


=Tasks other than parsing=
=Tasks other than parsing=
Line 637: Line 814:




===Lexical methods (recognizing words)===


'''Lexical methods (recognizing words)'''
A very simple (e.g. bayesian-style) model where word matches are basically votes for respective languages.


A very simple (e.g. bayesian-style) model where word matches are basically votes for respective languages.
Surprisingly accurate if your to-be-identified text is more than a few words,
and you have a relatively complete vocabularies for each of your target languages.




Surprisingly accurate if
It may be useful to make this more involved, e.g. do word n-grams to model probabilities of words likely to follow others, though there is some chicken-and-egg issue in the lack of normalization.
* your to-be-identified text is more than a few words,
* you have a reasonable vocabularies for each of your target languages.


The larger the text, this likelier that a bunch of words will match even limited lexical resources (though this can vary with how [[synthetic]] the language is), although it can be fragile with wordlist length and means of matching.
The larger the text, this likelier that a bunch of words will match even limited lexical resources (though this can vary with how [[synthetic]] the language is), although it can be fragile with wordlist length and means of matching.
Line 654: Line 831:




<!--
It may be useful to make this more involved, e.g. do word n-grams to model probabilities of words likely to follow others,  though there is some chicken-and-egg issue in the lack of normalization.
-->
'''character n-gram'''


===character n-gram===
A mechanically simple-enough approach is a statistical n-gram setup.
A mechanically simple-enough approach is a statistical n-gram setup.


Line 712: Line 882:




'''See also'''
===See also (language detection)===
* http://en.wikipedia.org/wiki/Language_identification
* http://en.wikipedia.org/wiki/Language_identification
* http://en.wikipedia.org/wiki/Wikipedia:Language_recognition_chart
* http://en.wikipedia.org/wiki/Wikipedia:Language_recognition_chart
Line 737: Line 907:


* http://www.speech.inesc.pt/~dcaseiro/html/bibliografia.html (someone's bibliography, for both written and spoken language)
* http://www.speech.inesc.pt/~dcaseiro/html/bibliografia.html (someone's bibliography, for both written and spoken language)
==Spell checking==
==Spell checking==


Line 745: Line 916:
-->
-->


=Corpus linguistics=
{{stub}}


Any study of language through (usually large) samples. Now very usually refers to computer-based analysis.


Examples for unannotated corpora are finding [[collocation]]s or more semantically related words. For annotated corpora it includes n-gram and tree structure analysis.


==Entity Linking==


==Distributional similarity==
The observation is that words that occur in the same contexts tend to have similar meanings (Harris, 1954) -- "A word is characterized by the company it keeps" (Firth, 1957).


Entity Linking means assigning a unique identity to detected entities.


Wikipedia's example of "Paris is the capital of France" points out that
* while NER can be fairly sure that Paris is an enitity at all,
* only EL commits to trying to figure out that Paris is probably
** a city (and not e.g. a person)
** and probably a specific city (among multiple called Paris) {{comment|(it probably returns some reference to that specific city)}}


==Collocation==
This may then go on to try to extract more formal ontology, and as such can also be useful in assisting annotation.




'''A collocation''' refers to words that occur together more often than basic probability would suggest (often based on word-level statistics).
https://en.wikipedia.org/wiki/Entity_linking


==Sentiment analysis==


Collocation analysis studies the adjacency of words. This often easily turns up things like


* preferences between words that (lexically speaking) might be interchangeable, some of which much stronger than others (consider the alternatives not chosen. Not necessarily wrong, but certainly [[marked]])
In theory, we study [https://en.wikipedia.org/wiki/Affect_(psychology) affective] states.  
** [[verb]]-[[preposition]] pairs, e.g. agree with
In practice we mostly estimate whether phrasing seems to agree more on a scale between polar opposites like positive/negative (or sometimes _just_ those opposites) -- because we only have some words to go on.
** [[adjective]]-[[noun]], e.g. maiden voyage, excruciating pain, spare time,
** [[verbs]]-[[noun]], e.g. commit suicide, make a bed
** [[verb]]-[[adverb]],  e.g. prepare feverishly, wave frantically
** [[adverb]]-[[adjective]], e.g. downright amazed, fully aware
** [[noun]] combinations: a ''surge'' of ''anger'', a ''bar'' of ''soap'', a ''ceasefire'' ''agreement''
** ...and more


* [[MWE]]s of some kind


<!--
This is usually specifically applied to things like customer reviews,  
Strong collocations may stay together as single lexical units,
seeing how happy they are about it, and what we can tie that to.
but there are a lot of collocations that seem to come from habitial association, and are more flexible.
-->


See e.g. http://www.ozdic.com/ for examples
There are other ideas, like trying to judge how subjective or objective statements are.


==Annotation==
Annotated data is much easier to learn from, while learning from non-annotated corpora avoids bias to annotation errors.


Manually annotated corpora are typically better (up to the point where human disagreement happens), but are costly in terms of time and/or money to create. Semi-automatic methods are often used but only help to some degree.
Because even a (fuzzy) classifier can help
Sometimes amount to little more than a soft classifier with two outcomes
or, more commonly, a value on that range.


Automating accurate annotation a useful thing, though it is arguably more application than a field in itself.
<!--
 
 
===Lexical annotation===
Probably the most common: Adds [[lexical category]] to words, according to a particular [[tagset]], and this implies knowing the [[lemma]] (...or having such knowledge in the trained data that e./g. statistical parsers are based on).


Classical techniques include LDA, LSA,
and some word vector / embedding variants.


===Phonetic annotation===
Newer transformer based tend to deal better with more implicit statements.
Records how words were pronounced, at least phonemically, sometimes with more detailed phonetic information and possibly [[prosody]] ([[intonation]], [[stress]]).
 


===Semantic annotation===
Disambiguates word senses. Relatively rare as it is costly to do manually (as any annotation is) and hard to do automatically.


https://en.wikipedia.org/wiki/Sentiment_analysis


===Pragmatic and Discourse/Conversation annotation===
https://medium.com/analytics-vidhya/explore-spiritual-world-using-nlp-natural-language-processing-23ff4742269d
(Sociological) discourse/conversation analysis is often served by systems of informed transciption/annotation is served by extra annotation by someone present or viewing video, since discourse sees a lot of incomplete, ungrammatical sentences that may depend on physical context, pauses, gestures and more - which is also crossing into pragmatic annotation.




There is also analysis that tries to extract e.g. reference across sentences (to resolve cross-sentence [[ellipsis]]). Systems such as Discourse Representation Theory (DRT) are formal analysis/specification systems.
-->
 
 
==See also==
* [[Wikipedia:Corpus linguistics]]
* http://www.athel.com/corpus.html - fairly large index of related sites
* http://www.natcorp.ox.ac.uk/corpora.html
* http://ahds.ac.uk/creating/guides/linguistic-corpora/chapter2.htm
* http://www.comp.lancs.ac.uk/ucrel/annotation.html
* http://bowland-files.lancs.ac.uk/monkey/ihe/linguistics/corpus4/4fra1.htm
 
[[Category:Computational linguistics]]
[[Category:Computational analysis]]
 
=Moving towards meaning=
 
Often shallow (as in mechanical rather than understanding) but good enough for many uses.
 
===Named entity recognition===
{{stub}}
 
The concept of named entities seems to come from the task of Named-entity recognition (NER),
which you could say identifies concepts for which we use fairly consistent names.


This is essentially two tasks in one: recognition, and classification


==Word sense disambiguation==


And typically, the aim is to specifically avoid a precompiled list (a 'gazetteer')
Assigning a specific sense from many related to a polysemous words, often via context.




There is no singular definition of the kinds of things that includes, but it often focuses on
* names of people,
* names of organizations,
* names of locations,
and also
* consistently encoded things (e.g. medical codes),
* quantities
* dates
* times
* money amounts


==Semantic role labeling; Relation extraction==
<!--
<!--


'''The simplest approach'''
'''Semantic role labeling''' assigns things like who is doing what,
e.g. saying that in "Bob is reading a book" or "the book was read by Bob",
it can label bob as the agent.


The simplest way to extract named entities is to make a list of known entities, and report their use.


For some applications this goes a long way, buy you run into several practical issues
'''Relation extraction''' basically takes a sentence and tells you some more structured relations - if you have sentences telling you who was born where and when who founded what.




* We like to think they are consistent names, but they are only ''moderately'' consistent and may only be nonambiguous in sppecific contexts.
SRL and relation extraction seem very close, and are,
in that roughly the same information lets you tell who was the active doer, and what was being done,
but doing everything at once requires more assumptions,
and some of these sub-tasks can be simpler and more neutral puzzle pieces.  


* Our writing styles like to abbreviate multiple word NERs to a fewer-word or single-word form for practical reasons.
Consider also how NER detects things like dates and places (so is ''great'' assistance in some of these),
: Even just people's names become first names, last names, may or may not include a title, and so on.
even while stated by itself it does not address them.


* Multi-word Tokens may be more easily resolves (because they are also collocations - their appearance together is unlikely to mean anything else) but there are plenty of named entities that are single words.


* In any one application, we may want to get more specific, recognizing names of e.g. events, products, languages, sports, works of art
https://en.wikipedia.org/wiki/Relationship_extraction
: specific applications may be clean enough for those to work well, where


https://en.wikipedia.org/wiki/Semantic_role_labeling


-->


'''Rules-based NER'''
==Coreference resolution==


Some of the above can be resolved just by making a more complete set of rules.
{{stub}}
Say, the case of titles and abbreviation isn't too hard to fix.


The upside is that you have good control of what happens,
and with enough time and work you can make it work on unseen entities.


the main downsides is that
A [[referent]] is a person or thing you can refer to by name.
: the complexity ''will'' grow out of control
: it will generalize poorly to real-world text. It may do good on e.g. books because they've seen several passes of editors doing their thing, but even then there are probably variants you're not matching.




Words are co-referential if multiple words refer to the same thing - the same referent.
For example, in 'John had his dog with him', 'John' and 'him' are co-referential.


'''Machine learning NER'''
This is common in most natural language, where more than one clause/sentence is talking about the same thing,
and repeating the full word feels quite redundant {{comment|(unless used for ''rhetorical'' anaphora - repetition for emphasis - which confusingly is a different concept from linguistic [[anaphora]] which is ''closely'' related to references)}} (potentially to the point of semantic satiation).


Machine learning means you leave the decisions up to something that someone else, or you, has trained.


The upside is that it can use more features than you could easily put into rules,
When parsing meaning, it is often useful to find (co)referents, particularly since pronouns (and other pro-form constructions) otherwise empty.
and tends to work moderately well on unseen entities in typical contexts,
and that you could teach it incrementally.


The downside is that when it estimates incorrectly, you don't have much to say about it
''other'' than doing more annotation to teach it better.


It might have been called reference resolution, but seems to be called coreference resolution because typically both words are present in the sentence.


(If not, e.g. It in {{example|It's raining}} or a vague unspecified they in "they say", there is usually little to resolve).




Coreference resolution is often eased by words [[marked]] for [[grammatical features]] like number, gender, case, tense, person, and such,
by tendencies like we tend to only refer to relatively recently mentioned concepts,
and such






NER runs into the general issue of linguistic ambiguity
http://nlpprogress.com/english/coreference_resolution.html


'''Domain adaptation'''
https://nlp.stanford.edu/projects/coref.shtml


https://en.wikipedia.org/wiki/Named-entity_recognition
https://en.wikipedia.org/wiki/Coreference


==Text summarization==
{{stub}}


"Mention detection"
Text summarization usually refers to the (automated) process of taking a text and producing anything from a subject to a naturally formatted summary.




Nested Named Entity Recognition
Basic implementations do only [[sentence extraction]], then simplify the whole by choosing,
https://paperswithcode.com/task/nested-named-entity-recognition
choosing which sentences are more interesting to keep, which may be as simple as some keyword-based weighing.




More complex implementations may do deeper parsing, and/or generate new text based on some semantically representative intermediate.


-->
==Entailment==
 
===Entity Linking===
 
 
Entity Linking means assigning a unique identity to detected entities.
 
Wikipedia's example of "Paris is the capital of France" points out that NER can be fairly sure that Paris is an enitity at all, but only EL commits to trying to figure out that Paris is probably a city, and a specific one (among multiple called Paris), and not e.g. a person.
 
This may then go on to try to extract more formal ontology, and as such can also be useful in assisting annotation.
 
 
https://en.wikipedia.org/wiki/Entity_linking
 
===Entailment===
{{stub}}
{{stub}}


Line 975: Line 1,091:




===Coreference resolution===
==Question answering==
 
 
===Coreference resolution===
{{stub}}
 
 
A [[referent]] is a person or thing you can refer to by name.
 
 
Words are co-referential if multiple words refer to the same thing - the same referent.
For example, in 'John had his dog with him', 'John' and 'him' are co-referential.
 
This is common in most natural language, where more than one clause/sentence is talking about the same thing,
and repeating the full word feels quite redundant {{comment|(unless used for ''rhetorical'' anaphora - repetition for emphasis - which confusingly is a different concept from linguistic [[anaphora]] which is ''closely'' related to references)}} (potentially to the point of semantic satiation).
 
 
When parsing meaning, it is often useful to find (co)referents,  particularly since pronouns (and other pro-form constructions) otherwise empty.
 
 
It might have been called reference resolution, but seems to be called coreference resolution because typically both words are present in the sentence.
 
(If not, e.g. It in {{example|It's raining}} or a vague unspecified they in "they say", there is usually little to resolve).
 
 
Coreference resolution is often eased by words [[marked]] for [[grammatical features]] like number, gender, case, tense, person, and such,
by tendencies like we tend to only refer to relatively recently mentioned concepts,
and such
 
 
 
http://nlpprogress.com/english/coreference_resolution.html
 
https://nlp.stanford.edu/projects/coref.shtml
 
https://en.wikipedia.org/wiki/Coreference
 
===Text summarization===
{{stub}}
 
Text summarization usually refers to the (automated) process of taking a text and producing anything from a subject to a naturally formatted summary.




Basic implementations do only [[sentence extraction]], then simplify the whole by choosing,
choosing which sentences are more interesting to keep, which may be as simple as some keyword-based weighing.


 
==Machine translation==
More complex implementations may do deeper parsing, and/or generate new text based on some semantically representative intermediate.
 
===Natural language generation===
{{stub}}
 
Natural language generation (NLG) can be taken as a subfield of [[Natural Language Processing]], but is primarily concerned with generating linguistic expressions, often from knowledge bases and logic.
 
 
http://en.wikipedia.org/wiki/Natural_language_generation
 
 
 
 
===Question answering===
 
 
 
<!--
===Semantic analysis===
e.g. finding the meaning of a query, to find a fitting answer.
-->
 
 
 
===Machine translation===




Line 1,056: Line 1,105:




====Text alignment====
<!--
===Text alignment===


===Statistical Machine Translation===
===Statistical Machine Translation===
<!--
A probablistic phrase-based approach is common, often using large amounts of aligned texts.  
A probablistic phrase-based approach is common, often using large amounts of aligned texts.  


This is not robust to unseen inflections, although this is a limited problem with enough data, and such translation may be augmented with morphological models. See e.g. Yang, Kirchoff, [http://www.google.com/search?q=phrase-based%20backoff%20models%20for%20machine%20translation%20of%20highly%20inflected%20languages Phrase-Based Backoff Models for Machine Translation of Highly Inflected Languages (2006)]
This is not robust to unseen inflections, although this is a limited problem with enough data, and such translation may be augmented with morphological models. See e.g. Yang, Kirchoff, [http://www.google.com/search?q=phrase-based%20backoff%20models%20for%20machine%20translation%20of%20highly%20inflected%20languages Phrase-Based Backoff Models for Machine Translation of Highly Inflected Languages (2006)]
-->


<!--


==See also==
==See also==
Line 1,078: Line 1,125:
* http://translation2.paralink.com/
* http://translation2.paralink.com/
-->
-->




Line 1,085: Line 1,130:
[[Category:Linguistics]]
[[Category:Linguistics]]
[[Category:Analytical linguistics]]
[[Category:Analytical linguistics]]
[[Category:Computational linguistics]]
[[Category:Semantics]]
[[Category:Semantics]]
==Natural language generation==
{{stub}}
Natural language generation (NLG) can be taken as a subfield of [[Natural Language Processing]], but is primarily concerned with generating linguistic expressions, often from knowledge bases and logic.
This has now been taken over by LLMs.
http://en.wikipedia.org/wiki/Natural_language_generation

Latest revision as of 16:21, 23 March 2024

Computational linguistics refers to the use of computers on language or text analysis.


Includes tasks like


NLP, NLU, and more

The more mechanical side

NLP has grown very broad.


With a lot of statistical and machine learning techniques around, there is a distinction you can make between

  • more mechanical approaches, which might end up with complete annotations
  • more statistical tasks, which

Tokenization, segmentation, boundaries and splitting

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Tokenization refers to dividing free text into tokens. Usually the smallest units you want to use for a particular analysis/computation.

For example, This, that, and foo. might turn into This ,   that ,   and   foo ..

You could also tokenize into sentences, lexical units or such, morphemes, lexemes, though you start running into fuzziness of boundaries, overlaps, and such/


Tokenization can be a useful organizational tool, to avoid your transforming tasks from interfering with each other. Consider that various analyses, tagging processes, transformations, groupings and such (POS tagging, stemming, inflection marking, phrase identification etc.) all want to work on the same token stream, but any changes from one task could interfere with every other task.

Tokenization may help by having tokens be some realization of object/attribute, but also be in the way if under-modelled for the tasks at hand, and it is also quite easy to create token-based frameworks that are restrictive, prohibitively bothersome, or both. It is generally suggested that data structures be kept as simple as possible, though there is some question to whether that's for a given task or in general, which is something that particularly language processing frameworks have to deal with.


Words

Taking words out of text is perhaps the most basic thing you may want to do.

In many language scripts, word boundaries are marked, many by delimiting with a space and some with things like word-final markers.

There are also scripts whose orthograhy does not include such marking. For example, reading Chinese involves a mental process of splitting out concept boundaries, which is sometimes fairly ambiguous without semantics of the context. Japanese does not have spaces either (though can assist one's reading by alternating kanji and kana at concept boundaries, where possible).


Sentences

Taking unstructured text or speech, and demarking where individual sentences are.

Often supports other tasks, e.g. for analysis to focus on structure within sentences. Sometimes as a more direct part of the point, as in text summarization.


Depending on a language's orthography and phonological structure, this task is not at all trivial when you want good accuracy on real-world text, even just on well-structured text. Utterances are much harder. In many languages, abbreviations also make things much harder.

Many initial and still-current implementations focus on mechanical, token-based rules, while in reality there are a lot of exception cases that can only be resolved based on content and context. Some can be dealt with with specific exceptions, but in the long term it is less messy and more accurate to use statistical, collocative analysis.


Notes:

  • sometimes, terms like 'sentence boundary detection' refer more specifically to adding sentence punctuation to punctuation-less text such as telegrams (somewhat comparable to word boundary detection in text and sound, and line break detection in text).


In writing systems that use capitals and punctuation, the simplest approach to text sentence splitting is splitting on periods, question marks and exclamation marks, which may get expanded with conditions such as 'followed by a space and a capital letter', 'not prepended by what looks like a common title or honorific', and such. This is a method that will always be fragile to specific types of writing and certain types of text, so will will never really yield a widely robust method. However, a day or two worth of rule development, ordering, and further tweaking can get you very decent results for corpus analysis needs.

It has been suggested that more robustness can come from adding functional sentence analysis, such as looking for presence/absence of (main) verbs, and other simple models of what we ourselves do when listening to sentences. Humans too will usually keep recent sentence parts in mind and will go back to verify. For example, in "It was due Friday at 5 p.m. Saturday afternoon would be too late.", it is probably only at would that many people realize that a new sentence probably started at Saturday. Perhaps a good heuristic for whether a split is sensible is looking at completeness of, considering verbs and perhaps other things (that don't necessarily need parsing or tagging, at least in a simple version).


Note that there is an upper limit on accuracy anyway - there are enough cases where people will not come to an agreement exactly where the splits really are.



Potential complications, particularly to the dots-and-context approach:

  • a period not used as a full-stop, such as various types of
    • titles/honorifics: Dr., Mrs. Dres.
    • acronyms/initialisms with periods, for example There are no F.C.C. issues., are often not sentence boundaries -- but they can be, when they sit at the end of a sentence.
    • similarly marked abbreviations (possibly incorrectly), such as feat., vs., and such.
    • numbers, e.g. The 3.048 meter pole and perhaps You have $10.33., and even The 2. meter pole.
  • quotes:
    • He asked me "where are you headed?" What? There. "What? There!" "What?" "There!" "Wheeee" said I. (with possible complications in the form of spaces, and other minor variations)
    • apostrophes can esily mess up single quotes balancing rules. Consider e.g. 'tis not for thee.
    • fancy quotes, mixed quote use
  • stylistic text (use/abuse)
    • list items often have no period, but are usually (considered to be) sentences
    • headings that continue in the text
    • You're probably designing for English by now (a trainable model may be more practical)
  • unusual syntax, such as bad internet english, l33t, computer code, and others


Interesting things to think about - and possible rules:

  • quoted text (and parenthetical text) can be considered sub-sentences; you may want to model that.
  • sentences cannot start with certain punctuation (period, comma, percent sign, dashes, etc.)
  • Sentence separators (/ sub-sentence) usually happen at/near characters like " - --
  • A period followed by a digit is often not a sentence boundary
  • A complete set of titles/honorifics is basically never complete - and a pattern like [A-Z][A-zA-Z][.] won't always do either.
  • .!? split (sub)sentences, and may generally be followed by " or )
  • 'the next character is a capital' is decent verification for a break at the current position -- but not always; names, proper nouns and nonsense triggers for the verification can mess that up.
  • sentences usually don't end with titles/honorifics (but can)
  • sentences rarely end with initials (but can).
  • sentences sometimes end with abbreviations.
  • balanced parentheses and balanced double-quotes are a matter of style, and may differ (see e.g. conventions in book dialogies) -- you probably want to do basic text analysis to see whether this is a useful constraint or not.


Other methods, that try to be less content-naive, include:

  • looking for patterns (in words and/or their part of speech) to judge whether any particular sentence split is likely correct
  • using sets of rules, heuristics, or probabilities (often token-based)
  • interaction with named entity recognition / name recognition
  • Information from overall text, e.g. positions of paragraphs, indented lines, and such
  • interaction with a POS tagging
  • ...and combinations.


Note that models that consider only direct context (e.g. simple regexps, token n-grams for small n, certain trained models) easily have an upper limit on accuracy, as the same direct context does not always mean identical sentence split validity.


Unsorted related links:


See also

Word boundary detection

Word boundary detection is useful when

http://icu-project.org/docs/papers/text_boundary_analysis_in_java/

Character boundary detection

Character boundary detection is relevant in Unicode, in that you can't split a string into multiple strings just anywhere. Consider pairs of a letter and a combining-accent pairs, Unicode characters that are drawn differently when combined, Unicode surrogates, and such.

-->


Chunking

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Where a fuller parser is understood as giving a more complete derivation of all the structure in a sentence (which is hard to do with good precision), a chunker can be seen as a "do one task (preferably well)"

For example, a chunker may do nothing more than detecting noun groups, e.g. detecting phrases with a good guess as to what their head is, possibly also trying verb groups, verb phrases. Maybe partitioning a sentence into NP, NP, prepositions, other.

And nothing more than that.


Their output is often also simple - nothing recusive, nothing overlapping.


Now that do-it-all parsers are more common, chunking for the "do a select job decently" is less common, but doing a specific job based on a fuller parse (e.g. detecting noun phrases) may be more common, and maybe more precise than classical chunkers because it gets to use more contextual information that the parser found. You can argue whether that is still called chunking.

parsing that provides a partial syntactic structure of a sentence, with a limited tree depth, as opposed to full on parsing


Named entity recognition

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Named-Entity Recognition (NER) considers Named Entities to by any concept for which we use fairly consistent names.

This could be considered two tasks in one: recognizing such a name, and classifying it as a particular type of entity.


And typically, the aim is to specifically avoid a precompiled list (a 'gazetteer'), and try to generalize a little so we can find things that fit a pattern even we don't know the specific case.


There is no singular definition of the kinds of things that includes, but it often focuses on

  • names of people,
  • names of organizations,
  • names of locations,

and also

  • medical codes and other consistently encoded things,
  • quantities
  • dates
  • times
  • money amounts

...probably in part because those are pretty predictable, though note that none of those are even names, they're just useful things to tag.


Stemming, lemmatization

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Stemming

Lemmatization

Algorithms and software

  • Dawson [1]
    • From 1974. Similar to Lovins; intended as a refinement.
  • PyStemmer[2]
    • Mostly a Python wrapper for Snowball's library.
    • Works on English, German, Norwegian, Italian, Dutch, Portuguese, French, Swedish. (verify)
  • Snowball[3]
    • mostly does suffix stripping.
    • Works on English, French, German, Dutch, Spanish, Portuguese, Italian, Swedish, Norwegian, Danish. (verify)
    • Under BSD license[4].
  • Krovetz [6]
    • From 1993(verify)
    • Accurate inflectional stemmer, but complicated and not very powerful.
    • R. Krovetz (1993) Viewing morphology as an inference process
  • Lovins
    • From 1960
    • See also [7]. [8]
    • See also JB Lovins (1968) Development of a stemming algorithm
  • Lancaster (Paice/Husk) [9]
    • a.k.a. Lancaster (Paice/Husk) stemmer
    • From 1990
    • Fairly fast, and relatively simple. Aggressively removes endings (sometimes more than you would want).
    • See also [10]
  • RSLP stemmer (Removedor de Sufixos da Lingua Portuguesa (RSLP))[12] -


See also (lemmatization)

  • Paice, C.D. (1996) Method for Evaluation of Stemming Algorithms based on Error Counting



Speech segmentation

The act of recognizing word boundaries in speech, and the artificial imitation to do the same in phonetic data.

Can also refer to finding syllables and phonemes.

Corpus linguistics

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Any study of language through (usually large) samples. Now usually refers to computer-based analysis.



Annotation

Annotated data is much easier to learn from, while learning from non-annotated corpora avoids bias to annotation errors.

Manually annotated corpora are typically better (up to the point where human disagreement happens), but are costly in terms of time and/or money to create. Semi-automatic methods are often used but only help to some degree.

Automating accurate annotation a useful thing, though it is arguably more application than a field in itself.


Lexical annotation

Probably the most common: Adds lexical category to words, according to a particular tagset, and this implies knowing the lemma (...or having such knowledge in the trained data that e./g. statistical parsers are based on).


Phonetic annotation

Records how words were pronounced, at least phonemically, sometimes with more detailed phonetic information and possibly prosody (intonation, stress).


Semantic annotation

Disambiguates word senses. Relatively rare as it is costly to do manually (as any annotation is) and hard to do automatically.


Pragmatic and Discourse/Conversation annotation

(Sociological) discourse/conversation analysis is often served by systems of informed transciption/annotation is served by extra annotation by someone present or viewing video, since discourse sees a lot of incomplete, ungrammatical sentences that may depend on physical context, pauses, gestures and more - which is also crossing into pragmatic annotation.


There is also analysis that tries to extract e.g. reference across sentences (to resolve cross-sentence ellipsis). Systems such as Discourse Representation Theory (DRT) are formal analysis/specification systems.


See also


Distributional similarity

The observation is that words that occur in the same contexts tend to have similar meanings (Harris, 1954) -- "A word is characterized by the company it keeps" (Firth, 1957).


Collocation

A collocation refers to words that occur together more often than basic probability would suggest (often based on word-level statistics).


Collocation analysis studies the adjacency of words. This often easily turns up things like

  • preferences between words that (lexically speaking) might be interchangeable, some of which much stronger than others (consider the alternatives not chosen. Not necessarily wrong, but certainly marked)
    • verb-preposition pairs, e.g. agree with
    • adjective-noun, e.g. maiden voyage, excruciating pain, spare time,
    • verbs-noun, e.g. commit suicide, make a bed
    • verb-adverb, e.g. prepare feverishly, wave frantically
    • adverb-adjective, e.g. downright amazed, fully aware
    • noun combinations: a surge of anger, a bar of soap, a ceasefire agreement
    • ...and more


See e.g. http://www.ozdic.com/ for examples

Tasks other than parsing

Hyphenation, syllabization

Syllabization is about spoken sounds - syllables.

Hyphentation is about typesetting - graphemes. It follows readability -- which usually means it follows sounds and/or morphemes. It is also not very strictly defined.


Syllabization

Syllabification (syllabication, sometimes syllabization) breaks a word into groups of constituent sounds.

This isn't always a hard, clear-cut grouping. Ambisyllabicity refers to a sound being the coda of one and the onset of the next syllable. For example, the t in bottle is the onset of the second, but also the coda of the first syllable (it helps that this is a stop).


There are various patterns/observations (some are basically rules), including:

  • every syllable needs a vowel sound
    • amount of syllables == amount of vowel sounds (always?)
  • digraphs are not divided
  • consonant blends are not divided
  • compound words will split their parts
  • presence of x or ck: usually divided after
  • adjacent consonants: often split there
  • vcv: often split after the c


http://english.glendale.cc.ca.us/syllables.html http://www.createdbyteachers.com/syllablerulescharts.html http://teacher.scholastic.com/reading/bestpractices/phonics/syllabication.pdf



Hyphenation

Hyphenation (usually) refers to inserting a hyphen (with variation in typography) in a word, for one of many reasons, including...

Typographic hyphenation

One common use appears in typesetting, for layouting reasons: hyphens often break long words between two lines, to avoid having the line be wider than the rest of the paragraph or, in full justification, to avoid having to insert very wide spaces between all words.

Since this is primarilt about readability, (which, yes, is guided by morphemes and pronunciatin) it turns out to be mostly about what to avoid, more about acceptability than (singlar) correctness.

That leaves plenty of room for you to be lost in purposeless pedantry trying to create singular correctness.

There are many possible rules, plenty of exceptions to those, a good set of words with disagreements.

Such hyphenation is most important in good-looking fully-justified text, and less important in right-ragged text. For the latter, no hyphenation is better than bad hyphenation, though particularly long words cannot be escaped, so some minimal rules is a good idea.

A decent read: http://www.melbpc.org.au/pcupdate/9100/9112article4.htm



Dialect-specific

The decision of where to break a word in typesetting can vary. For example for English, US hyphenation is based on pronunciation so will generally follow syllable edges, while UK hyphenation plays by etymology/morphemes first, sound second. For example: the hyphenation of mechanism is mechan-ism in the UK, and mecha-nism in the US. That of progress is pro-gress in the UK, prog-ress in the US.

For both US and UK English, style guides will add some rules, for example avoiding breaks after just the first character (e.g. o-riginally), and avoid cases where you may feed misleading expectations on what the rest of the word is, for example in coin-cidence or co-inage.


American English:

  • based on sound, both in that it follows stress and tends to follow syllables

British English:

  • based on morphology, then sound

Australian English:




Computational hyphenation=

Computational hyphenation has existed for quite a while, for example in anything that needs to do text layouting.

Hyphenation also be helpful to some computational linguistic tasks, such as recognizing similar words (....slightly more strictly than with no word analysis at all...) partly because they'll be pretty consistent e.g. between different inflections and derivations of the same word.


Hyphenating well is a complex task, because it may involve pronunciation, etymology, human-readability, and the combination of those is (in many languages) not be easy to succinctly convert to a few rules. Such rules may in fact give conflicting suggestions.


It is not as hard to get decent results, though. For example, LaTeX typesetting guesses possible break points by character context, and puts different weights on each. It uses this very simple model to choose the most likely break point, and prefer easy breaks over more questionable breaks (in words and between words, as this is purely about typesetting).

Such hyphenation is rarely actually correct, in that it doesn't necessarily split on syllables, morphemes, or necessarily do clever things with unusual words such as loan words, nor will all possible break points necessarily turn up, but considering its simplicity, this system works quite well.


In this context, a soft hyphen carries two meanings:

  • an invisible character that is a manual hint for automatic hyphenation (a sense used largely used in word processors, and also web browsers)
  • a character automatically inserted as part of line breaking. That is, the visual result inserted by the hyphenation process.


See also:

Readability, convention

within words

Certain constructions, often morphological (e.g. affixes, agglutination), can introduce letter sequences that are harder to read.

In English, hyphenations are commonly seen to help readability. Cases include

  • places where double vowels would be introduced, such as in anti-inflammatory (rather than antiinflammatory)
  • cases where the original word was capitalized, such as anti-Semitic (rather than antiSemitic or antisemitic)

Rules like these exist in many languages, though the reasons and details vary - and may not be quite as strict as some people and books claim.


lexicalized

In English, compounds (of of specific types) can (often because of delexicalization) variably appear with a space, a hyphen, or agglutinated (written together). A number of compounds are hyphenated because of their current state in that process.

There are quite a few constructions that are usually hyphenated (in an almost lexical(ized) way), such as expressions and act as compounds, and some uses of morphemes, such as 'self-' as a prefix.


Some of these things happens more in UK English than in US English.

For example, it is not unusual to see "good-bye" in UK English, where US English would write it as "goodbye". In this case, both English variants allow both cases, but have a preference. In other cases preferences may be is stronger, and the other case is considered to look weird or incorrect.


The patterns that control these things are influenced by a language's tendency to agglutinate, and various other things.


Compounds
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Hyphens can mark cases where multiple words act as a phrase, but could be easily misunderstood or misread.

There are few wide rules and tendencies, a bunch of specific-case rules, and quite a bit of real-world variation.


Notes

  • UK English does it more than US English
  • rarely done after -ly adverbs or comparatives
  • common when there is ambiguity in compound modifier / modified compounds
    • Ambiguity depends on case and context, and hyphenation is optional when one meaning is a lot more obvious, and the alternative meaning would either lead to other formulations or not occur to us at all. For example, "High school students", while technically ambiguous, is rarely understood as "school students high on drugs", so the hyphen in high-school is optional.
  • noun-verb combinations get this treatment (more than noun-noun)
    • seems much more likely when a modifier appears before the noun, much more rarely if it appears after
  • considered when a phrase leads to a garden path quality within a sentence, e.g. some uses of ill-advised, level-headed
  • older compounds, common compounds, idioms, and anything else collocative may be lexicalixed enough to be less ambiguous today, so see less hyphen use. You would probably read "ill advised patient" as "ill-advised patient", which you wouldn't if 'ill-advised' wasn't a thing.


Lexical hyphens

Heteronyms in different lexical categories may have different hyphenation (and syllabization).

Consider 'invalid' as a noun (in-va-lid) and adjective (in-va-lid), largely due to lexical stress.

See also (hyphenation)

Language detection

Identifying the language of a given piece of text can be handy for various language-specific computational tasks, including text classification, sentence splitting, search fuzzing, and more.

In some ways just any classification problem.


Your method may need to focus on one or more of, and might choose to use all of:

  • languages
  • writing systems (also e.g. romanizations and such), arguably some real-world resolution you can paste on top of mechanical parts accurately enough
  • text encodings (you can normalize this by detecting text coding and using only unicode)
Note that character set may be a strong indication of the stored language

...and this should probably be a choice, not side effect of your method.

A proper system probably can use all such information, but is not distracted by it. You could e.g. detect text encoding (see e.g. http://chardet.feedparser.org/) and use it to give a mild boost to the languages that encoding is usually used for, and have your actual detection mechanism work on unicode characters to base it on actual human writing and not computer coding.


Lexical methods (recognizing words)

A very simple (e.g. bayesian-style) model where word matches are basically votes for respective languages.

Surprisingly accurate if your to-be-identified text is more than a few words, and you have a relatively complete vocabularies for each of your target languages.


It may be useful to make this more involved, e.g. do word n-grams to model probabilities of words likely to follow others, though there is some chicken-and-egg issue in the lack of normalization.

The larger the text, this likelier that a bunch of words will match even limited lexical resources (though this can vary with how synthetic the language is), although it can be fragile with wordlist length and means of matching.

For robustness on smaller texts you want to have a large enough vocabulary for each langauge, and what constitutes 'enough' can vary with the language, for example with how much affixing and synthesis happens in that language.

Note that you may still want to use probabilities and/or have a lower limit on the amount of indicators required, because things like adopted words and jargon and such may throw off the guess to a historically related or an auxiliary language.


character n-gram

A mechanically simple-enough approach is a statistical n-gram setup.


There are a number of varying implementations of the n-gram idea, based on the observation that alphabet/character usage and language's morphology make for character sequences that point to particular languages.

It is also limited. Consider that related languages share a common origin, and are likely to share both roots and some morphology, particularly languages with a latin or roman background, and differences in spelling and morphology may not stand out very much from the things shared.

Even so, n-gram methods can work well enough based when trained on just a few example documents per language and when asked to classify n-grams from half a dozen words or more.



See also (language detection)


  • Beesley (1988) Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Text [14]
  • WB Cavnar, JM Trenkle (1994) N-Gram-Based Text Categorization [15]
  • A Poutsma (2001) Applying Monte Carlo Techniques to Language Identification [16]


  • P Sibun, JC Reynar (1996) Language Identification: Examining the Issues [17]
  • G Grefenstette (1995) Comparing Two Language Identification Schemes [18]
  • Hughes et al. (2006) Reconsidering Language Identi?cation for Written Language Resources [19]
  • EM Gold (1967) Language identification in the limit [20]



Spell checking

Entity Linking

Entity Linking means assigning a unique identity to detected entities.

Wikipedia's example of "Paris is the capital of France" points out that

  • while NER can be fairly sure that Paris is an enitity at all,
  • only EL commits to trying to figure out that Paris is probably
    • a city (and not e.g. a person)
    • and probably a specific city (among multiple called Paris) (it probably returns some reference to that specific city)

This may then go on to try to extract more formal ontology, and as such can also be useful in assisting annotation.


https://en.wikipedia.org/wiki/Entity_linking

Sentiment analysis

In theory, we study affective states. In practice we mostly estimate whether phrasing seems to agree more on a scale between polar opposites like positive/negative (or sometimes _just_ those opposites) -- because we only have some words to go on.


This is usually specifically applied to things like customer reviews, seeing how happy they are about it, and what we can tie that to.

There are other ideas, like trying to judge how subjective or objective statements are.


Because even a (fuzzy) classifier can help Sometimes amount to little more than a soft classifier with two outcomes or, more commonly, a value on that range.


Word sense disambiguation

Assigning a specific sense from many related to a polysemous words, often via context.


Semantic role labeling; Relation extraction

Coreference resolution

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


A referent is a person or thing you can refer to by name.


Words are co-referential if multiple words refer to the same thing - the same referent. For example, in 'John had his dog with him', 'John' and 'him' are co-referential.

This is common in most natural language, where more than one clause/sentence is talking about the same thing, and repeating the full word feels quite redundant (unless used for rhetorical anaphora - repetition for emphasis - which confusingly is a different concept from linguistic anaphora which is closely related to references) (potentially to the point of semantic satiation).


When parsing meaning, it is often useful to find (co)referents, particularly since pronouns (and other pro-form constructions) otherwise empty.


It might have been called reference resolution, but seems to be called coreference resolution because typically both words are present in the sentence.

(If not, e.g. It in It's raining or a vague unspecified they in "they say", there is usually little to resolve).


Coreference resolution is often eased by words marked for grammatical features like number, gender, case, tense, person, and such, by tendencies like we tend to only refer to relatively recently mentioned concepts, and such


http://nlpprogress.com/english/coreference_resolution.html

https://nlp.stanford.edu/projects/coref.shtml

https://en.wikipedia.org/wiki/Coreference

Text summarization

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Text summarization usually refers to the (automated) process of taking a text and producing anything from a subject to a naturally formatted summary.


Basic implementations do only sentence extraction, then simplify the whole by choosing, choosing which sentences are more interesting to keep, which may be as simple as some keyword-based weighing.


More complex implementations may do deeper parsing, and/or generate new text based on some semantically representative intermediate.

Entailment

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Given two text fragments, (textual) entailment tries to answer whether (most people would agree that) the meaning of one can be inferred (is entailed) from the other.


Usually one of the two fragments is a short, simple hypothesis construction. To steal some examples:

Fragment Hypothesis fragment
Sir Ian Blair, the Metropolitan Police Commissioner, said, last night, that his officers were 'playing out of their socks', but admitted that they were 'racing against time' to track down the bombers. Sir Ian Blair works for the Metroplitan Police.
Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year Yahoo acquired Overture
(based on question "Who aquired Overture" and fragment)

Hypotheses might well be based on the fragment itself (e.g. in information extraction), or might be matched with fragment (e.g. in question answering). Many of the papers on entailment argue for what does and what does not constitute entailment.


Matching is often relatively basic, with little consideration of modifiers, quantifiers, conjunctions, complex syntactical structures, complex semantical structures, or pragmatics.


Considered a fairly generic NLP task, potentially useful to various others. consider:

Examples of entailment:



See also:


Question answering

Machine translation

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Machine translation mostly concerns itself with transfer of lexical and semantic information between sentences in different languages.

Semantic and pragmatic transfer is ideal, but most current models are not advanced enough to deal with those (though may of course do so in in accidental easy/best cases).

Natural language generation

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Natural language generation (NLG) can be taken as a subfield of Natural Language Processing, but is primarily concerned with generating linguistic expressions, often from knowledge bases and logic.

This has now been taken over by LLMs.


http://en.wikipedia.org/wiki/Natural_language_generation