Revision as of 13:31, 28 June 2024

Marked forms of words - Inflection, Derivation, Declension, Conjugation · Diminutive, Augmentative

Groups and categories and properties of words - Syntactic and lexical categories · Grammatical cases · Correlatives · Expletives · Adjuncts

Words and meaning - Morphology · Lexicology · Semiotics · Onomasiology · Figures of speech, expressions, phraseology, etc. · Word similarity · Ambiguity · Modality ·

Segment function, interaction, reference - Clitics · Apposition· Parataxis, Hypotaxis· Attributive· Binding · Coordinations · Word and concept reference

Sentence structure and style - Agreement · Ellipsis· Hedging

Phonology - Articulation · Formants· Prosody · Sound change · Intonation, stress, focus · Diphones · Intervocalic · Glottal stop · Vowel_diagrams · Elision · Ablaut_and_umlaut · Phonics

Speech processing · Praat notes · Praat plugins and toolkit notes · Praat scripting notes

Analyses, models, software - Minimal pairs · Concordances · Linguistics software · Some_relatively_basic_text_processing · Word embeddings · Semantic similarity

Unsorted - Contextualism · · Text summarization · Accent, Dialect, Language · Pidgin, Creole · Natural language typology · Writing_systems · Typography, orthography · Digraphs, ligatures, dipthongs · More linguistic terms and descriptions · Phonetic scripts

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Idiom

In linguistics, idiomaticity, idiomaticness, or just idiom can refer to the concept of "among all the possible realizations, this is the one the language ended on", the least marked way of expressing this, which sometimes is a fixed sequence, but also studying the mildly flexible patterns can be important e.g. to language learning.

For example, consider that Consider e.g. that a lot of common verbs have a preferred the adpositional particle, e.g. work on,

https://en.wikipedia.org/wiki/Idiom_(language_structure)

Outside of linguistics, idioms are mostly understood as figures of speech and other such figurative language.

This can be considered just one everyday case of the wider concept.

https://en.wikipedia.org/wiki/Idiom

Figures of speech

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A figure of speech is any use of a word/phrase where the intended meaning deviates from the literal meaning -- anything that is more figurative than purely literal.

It's not much of a stretch to go beyond word choice and include any sort of induced non-literal meaning used in arguments, in literature, in cinema, in politics, and in other kinds of storytelling.

They seem constrained only by creativity and other people's ability to understand the result, and since we do this a lot, there's a large amount of things that fit the description, and it's something we spit up into more things to discuss. So we have a bunch more words in the area. Some overlapping with each other. And some from semantics and pragmatics that happen to be quite relevant. Such lists reveal that this is for a large part related to our habit of rhetoric.

For example:

allusion - casual/brief reference (explicit or implicit) to another person, place, event, etc.

meiosis - intentional understatement [1]

e.g. 'the pond' to refer to the Atlantic Ocean, 'the troubles' for the northern irish conflict

litotes - understatement that uses a negation to express a positive, e.g. using "not bad" to mean pretty good.

actual meaning can depend on context

e.g. 'not bad' could have any literal meaning from 'not entirely horrible as such' to 'excellent'.

oxymoron - conjunction of words with intentionally contradictory meaning (see also contradiction in terms, paradox)

e.g. act naturally, old news, minor crisis, oxymoron (roughly means sharp dull)

sometimes less intentional, e.g. original copy

civil war would apply except you can both argue that's a calque (loan translation) just pointing out it's between civilians, and/or that it's equivocating civil in the sense of polite, and in the sense of groups within the same state.

irony - intentional implication of the opposite of the standard meaning (verify)

metaphor - implied comparative description that implies some sort of similarity

usually by equating things with no direct relation.

Often used to economically imply certain properties.

Similar but different from simile, which is an explicit comparison

allegory - sustained metaphor, usually tying in various metaphors related to an initial one[2]

parable - anecdotal extended metaphor intending to make a (often didactic or moral) point [3]

catachresis - a mix of more than one metaphor (by design or not) (verify) [4]

tropes - less-literal reference often understood as a replacement, e.g. in rhetoric, storytelling

when approached as "what we do in storytelling", many of the above apply, particularly the ones that play on meaning, twist meaning, lead to contrasted interpretations

hyperbole - exaggeration meant to be used as stress
- auxesis - hyperbole relying on word choice, e.g. 'tome' for a book, 'laceration' for a scratch
- adynaton - extreme hyperbole, suggesting impossibility [5]

metonymy and synecdoche - reference to proximate object, often metaphorical

for example

'the law' to refer to the police

'hired hands'

'bricks and mortar' [6]

'bread' for food in general

equating a university's actions with its board

hendiadys

an implied analogy

stylistic reasons (rhythm, rhyme)

rhetoric

a trope
a play on words
euphemisms

Tropes

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

More of a device in literature and rhetoric (than in linguistics directly), tropes are rhetorical figures of speech understood specifically as a replacement with a less-literal meaning.

Many also rely on a play, twist, or approximation of words or meaning, and contasts, so includes things like

Which makes them most associated with rhetoric, storytelling and cinema, where there is specific focus on how concepts are conveyed.

In particular, we often imply concepts them from patterns we recognize, without having to spell them out, and often use layers of contextual meaning.

For example, in writing and speaking, tropes are often employed for the more colorful result that is more interesting to read or listen to, and is often explained as a part of rhetoric.

In particular visual storytelling has its own conventions, as it can both add visual metaphor, and more easily hide details, as well as rely on consistently doing symbolism, no matter whether it makes sense or not. [7][8])}}

Metonymy, Synechdoche

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Metonymy (meaning something like 'change of name') is using one name/entity for another thing.

Related to meronym/holonym, but often with a specific part that is also representative of the whole.

For example:

The Crown to refer to the British monarchy. Similarly, Washington sometimes to refer to the United States government
"Nixon bombed Hanoi" ('Nixon' referring to the armed forces controlled by Nixon)
"A dish" referring to a part of a meal
"The car rear-ended me" ('me' referring to the car that the speaker was driving)

"Bread and circuses" to superficial appeasement
"The pen is mightier than the sword" (the pen mostly referring to publication of ideas, 'the sword' referring to a show of force).
"Lend me your ears", meaning listen to me, give me your attention

Metonymy tends to be associative, usually to refer to a more complex concept with a brief term.

The reference does not necessary carry any shared properties. The British monarchy is not crown-shaped, food isn't like the plate it's on, appeasement does not need to take the form of food and distracting entertainment, a driver doesn't resemble their car, publication isn't done with a pen, Nixon and the bombers only shared as much as being in the same chain of command.

A number are likely to be culturally embedded, and somewhat local.

Contrasted with

metaphor in that they intentionally compare across domains

analogy, which works by similarity, often explicit comparison, and is usually used to communicate a shared quality/property. In contrast, metonymy works by contiguity/proximity and is used to suggest associations.

Synechdoche is a subset of metonymy where one name is part of another.

For example,

saying that there are hungry mouths to feed,

or referring to your car as your wheels.

The distinction between metonymy and synechdoche is not always clear.

For example, "The White House said..." could refer to the President, his staff, or both with or without the distinction.

You could argue that both are part of the one concept - or that actions of one are distinct and only associated with actions of the other.

Synechdoche is a figure of speech where a term is used to refer to something else

referring to a whole by a part (perhaps the most common variant)
- Example: 'hands' to refer to workers

referring to a part by the whole
- Example: "The city put up a sign", "The world treated him badly", 'the police', 'the company',

referring to a wider class by example
- Example: 'Bug' (for various insects, spiders, and such), give us our daily bread (food), using brand names like kleenex, xeroxing, googling

referring to an example by a wider class
- Example: milk (usually meaning cow's milk),

referring to an object made from a material by that material
- Example: threads (clothing), silver (for cutlery),

referring to contents by its container (also relying on context)
- Example: keg (of beer),

Some examples are more complex, such as "silver" (material used for a relatively uncommon realization of cutlery), "the press" (a printing device referring to the news media, but also commonly to a team from)

Synechdoche can be the source or realization of of various fallacies, including fallacy of division, hasty generalization, and more.

Schemes

Sayings

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

When we have figures of speech that are non-literal, refers to a self-contained message, that we recognize as such (typcally because they have become lexicalized enough to be recognized, and reproduced fairly fairfully), we tend to call that a saying (or idiom), or something more specific.

Can be comments, references, observations, reactions, and aphorisms and the like.

We have dozens of variants of that, including:

Aphorism – a saying that contains a general, observational truth; "a pithy expression of wisdom or truth".

adages or proverbs refer to those that are widely known, and/or in long-term use

Cliché or bromide – a saying that is overused and probably unoriginal

which are platitudes when not very applicable, useful, or meaningful

Idiom – a phrase that means more than the sum of its parts

often mainly (or only) has non-literal interpretation.

More than compositional, perhaps not at all, and hearing it for the first time may gives no meaning

(There are other meanings for idiom, related to expression, but they are rarer and usually mentioned by their meaning)

Epithet – a byname - a saying or word used as a nickname, already having been widely associated with the person, idea, or thing being referred to.

including those added to a name, e.g. Richard the Lion-Heart

but more often adjectival characterization, e.g. Star-crossed lovers (when referring to Romeo and Juliet)

Maxim - An instructional saying about a principle, or rule for behavior.

Which occasionally makes it an aphorism as well

Motto – a saying used to concisely state outlook or intentions.

Mantra – a repeated saying, e.g. in meditation, religion, mysticism,

Epigram – a (written) saying or poem commenting on a particular person, idea, or thing.

Often clever and/or poetic, otherwise they tend to be witticisms.

Often making a specific point. Often short. Can be cliche or platitude.

Witticism – a saying that is concise and, preferably, also clever and/or amusing.

Also quips - which are often more in-the-moment.

Also related:

colloqualism is something that originated in verbal speech.

This can apply to idioms and the like , can be informal names where a formal one also exists,

-->

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Substituted phrases and/or double-meaninged phrases

🛈 Note that some of this moves well out of phraseology, into 'meanings and words are complex, okay' territory

Substituted phrases

('Substituted phrases' is not a term from linguistics, but seems a useful one to group euphemism and some related concepts)

Euphemism replaces a words/phrases with others, while preserving most meaning.

Typically the replacement is a form that is less direct - more figurative, possibly metaphor, or another reason to pick an nearby meaning.

The intent is often to say saying something without saying it directly, for reasons like:

softening emotional blows (e.g. passed away instead of died),

tact to avoid potential offense (student is not working to their full potential, developing countries)

understatement, e.g. 'more than a few' for 'a lot'

avoiding rude sounding words (pretty much any other word used for toilet, including toilet itself originally, is a fairly polite reference for the place where we poop)

but probably the most fun and thereby the most common is hinting at sexiness. To the point where any unknown phrase, particular in the form <verb>ing the <adjective>d <noun>, potentially makes us giggle.

compare with innuendo, double entendre

not mentioning implications, often doubletalk, e.g. downsizing the department (firing a bunch of people), collateral damage (we murdered some civilians we didn't mean to), special interrogation (torture).

powerful sounding business bullshit[9]

(not the best example because these tend to creatively obfuscate meaning in ways that are much less generally known than doubletalk)

A dysphemism and cacophemism replaces a word/phrase with a more offensive one, say, Manson-Nixon line for Mason–Dixon line.

Cacoponism refers to the more blatant and intentionally offensive variation.

Multiple meanings

Polysemy refers to a word referring to multiple distinct possible meanings (or, if you're going to get semiotic, any sign/symbol that does).

Usually multiple related meanings, and this can have useful generalizing value. In fact, in many languages, many words have a small amount of this.

Aside from words in dictionaries, though, we also have phrasing that plays with meanings.

A double entendre is a phrase/sentence that can in be interpreted in different ways - and mostly cases where at least one is dirty.

The extra meaning is intentionally there, but the fact they they can be masked by the more direct (and clean) read gives some deniability, though depending on how you say it, not much.

The words themselves don't necessarily hint at the second meaning. The understanding may come from context and both parties thinking the same thing - a complicitness.

If you go looking, you can find a lot of unintentional ones, like anything you can "that's what she said" to.

A single entendre isn't really a thing, though is used to point out when people didn't quite manage to make their entendre double, and mainly manage a single vaguely vulgar meaning.

Innuendo uses language to allude to additional meaning, yet with a wording that leaves some plausible deniability (without that deniability it would be clear insinuation).

Innuenndo can be fun and good natured (and is the n much closer to double entendre, which also only works when both parties understand the suggested meaning) but innuendo is more often specifically used to imply (often clearly imply, but only imply), and to specifically imply something negative - to disparage, to hint at foul play, plant seeds of doubt about someone, their reputation, or such (see e.g. the early stages of many american presidential runs).

Innuendo, like euphemisms, does not have to be sexual, though this perhaps is as common as it is assumed.

Double entendre does not have to be intentional, innuendo (and single entendre) is.

Puns use either multiple meanings of a word, or similar-sounding words, for humour or rhetorical effect. We mostly know them for the really bad ones.

Phraseology

Phraseology studies and describes the context in which a word is used, a mainly descriptive approach.

Concepts in the area

Collocations

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Collocations are statistically idiosyncratic sequences: series of words that occur together more often than just chance would suggest

Put another way, of all the possible sequences of words, it shows the sequences that come up more often - say, "pretend to", "as a matter of fact", "leaves all parties", "downright amazed", "good news".

People have slightly varied focus.

Collocations are useful in language learning/teaching, which may
- point at some grammatical idiosyncrasies, like which prepositions tend to sit next to which verbs, and which verbs tend to be how you do specific nouns (see examples below), as this can matter to the correctness of sentences
- point out that a lot of collocations are not compositional, so when some adjective-noun combination doesn't seem to make direct sense (e.g. bright idea), you can assume it's some sort of expression you should look up
- overlap strongly with technical terms and jargon - things that carry strong meaning but are not compositional.
- ...see e.g. http://www.ozdic.com/ for examples

Collocations matter to translation.

collocations make translation harder in that word-for-word translation will be wrong (not being compositional)

since collocation analysis points out a sequence is idiosyncratic, it can makes it easier to detect that, which helps focus on learning what it might correspond to

natural language generation would like to know these preferred combinations

we might also be able to suggest that certain words are more likely to appear than others, helping spelling correction, OCR correction, etc.

collocation may focus more on things that are appear more than you would expect, but it is sometimes also useful to note that unusual, i.e. less likely

some uses use reveals cultural attitudes, e.g. which adjectives we use for behaviour of specific groups

linguistics may study smaller idiomatic preferences - say, in "VERB a decision", you would probably prefer 'make' over 'take' or most other verbs. Similarly,

adjective-noun, often or preferred adjective used to make a noun stronger or more specific, e.g. maiden voyage, excruciating pain, bright idea, spare time, broad daylight, stiff breeze, strong tea. Alternative adjectives would typically be understood but be considered some amount of unusual

adverb-adjective, e.g. downright amazed, fully aware

verb-adverb, e.g. prepare feverishly, wave frantically

verb-preposition pairs, e.g. agree with, care about

verbs-noun, e.g. we make rather than take a decision, we make rather than tidy a bed

noun combinations: a surge of anger, a bar of soap, a ceasefire agreement

it might bring up other patterns useful to natural language parsing - e.g. we agree with someone, we agree on something

Collocation analysis

The reference probabilities

Filters and assumptions

Choices in math - and avoiding an explosion of data

MWEs

Phraseme

Unsorted

Figure of speech

Circumlocution

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Circumlocution refers to using unnecessarily large number of words to express an idea in a roundabout way.

It frequently turn up as indirect and/or long-winded descriptions where more succinct ones exist, less-direct figures of speech where clearer ones exist, or such.

Circumlocution may be done

to avoid revealing information,

to be intentionally vague (e.g. with equivocation),

in language aquisition, teaching meaning via description

to work around not knowing a term in another language and getting there via description

to work around aphasia

avoid saying specific words (euphemism, cledonism),

to construct euphemisms in the innuendo sense

to construct equivocations

and (other) varied rhetoric

creatively, to set set up similes

It can also refer to avoiding complex words, and/or inflected/derived terms (see e.g. Periphrasis), and usefully so.

Dictionaries are often intentionally somewhat circumlocutory, to avoid a lot of entries depending on other entries others you would have to look up.

Computational aspects

Phrase chunking, phrase identification

Compounds

http://wiki.apertium.org/wiki/Compounds

Named entities

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Named entities usually refer to finding/recognizing phrases that are (used as) nominal compounds.

Systems often deal with entities such as persons, organizations, locations, named objects, and such, particularly when they can work from known lists.

The same systems often also extract simple references such as times and dates, quantities such as monetary values and percentages.

Specific tasks in the area may be referred to / known as:

Entity Extraction
Entity Identification (EI)
Named Entity Extraction (NEE)
Named Entity Recognition (NER)
Named Entity Classification (NEC)
...and others.

Figures of speech, expressions, phraseology, etc.: Difference between revisions

Revision as of 13:31, 28 June 2024

Contents

Idiom

Figures of speech

Tropes

Metonymy, Synechdoche

Schemes

Sayings

Substituted phrases and/or double-meaninged phrases

Substituted phrases

Multiple meanings

Phraseology

Concepts in the area

Collocations

Collocation analysis

The reference probabilities

Filters and assumptions

Choices in math - and avoiding an explosion of data

MWEs

Phraseme

Unsorted

Figure of speech

Circumlocution

Computational aspects

Phrase chunking, phrase identification

Compounds

Named entities

See also

Navigation menu

@@ Line 34: / Line 34: @@
-In linguistics, '''idiomaticity''', '''idiomaticness''', or just '''idiom''' can refer to the concept of "among all the possible realizations, this is the one the language ended on", so referring to the particular syntax, grammar, and other structural patterns a language has.
+In linguistics, '''idiomaticity''', '''idiomaticness''', or just '''idiom''' can refer to the concept of
+"among all the possible realizations, this is the one the language ended on", the least [[marked]] way of expressing this,
+which sometimes is a fixed sequence, but also studying the mildly flexible patterns can be important e.g. to language learning.
-'''Idiomatic''' may more specifically point at
-the most preferred realization, and/or patterns - whatever makes you ''not'' evoke a "that's a weird way to say that" or "that's not how that typically works in a sentence".
+For example, consider that
 Consider e.g. that a lot of common verbs have a preferred the [[adposition]]al [[particle]], e.g. work ''on'',
 https://en.wikipedia.org/wiki/Idiom_(language_structure)
+Outside of linguistics,
+idioms are mostly understood as [[figures of speech]] and other such figurative language.
-Outside of linguistics, an idiom tends to more specifically refer to any figurative, non-literal phrases to express an idea,
+This can be considered just one everyday case of the wider concept.
-and can easily include [[figures of speech]] and such.
 https://en.wikipedia.org/wiki/Idiom
@@ Line 483: / Line 482: @@
-'''Collocations''' are statistically idiosyncratic sequences, series of words that occur together more often than just chance would suggest.
+'''Collocations''' are statistically idiosyncratic sequences: series of words that occur together more often than just chance would suggest
-<!--
-This isn't about comparing texts. It's about the assumption that at any point you might choose any words, and people actually are more likely to choose one.
--->
-This points to anything between word use together out of historical/idiosyncratic habit, MWEs, grammatic patterns, very specific idioms, or anything inbetween - 'any set of words better treated as a single token'
+Put another way, of all the possible sequences of words, it shows the sequences that come up more often - say, "pretend to", "as a matter of fact", "leaves all parties", "downright amazed", "good news".
 People have slightly varied focus.
-Say, language learning/teaching may
+* Collocations are useful in language learning/teaching, which may
-* point at some grammatical idiosyncrasies, like which prepositions tend to sit next to which verbs, and which verbs tend to be how you ''do'' specific nouns (see examples below), as this can matter to the correctness of sentences
+** point at some grammatical idiosyncrasies, like which prepositions tend to sit next to which verbs, and which verbs tend to be how you ''do'' specific nouns (see examples below), as this can matter to the correctness of sentences
+** point out that a lot of collocations are not compositional, so when some adjective-noun combination doesn't seem to make direct sense (e.g. bright idea), you can assume it's some sort of expression you should look up
+** overlap strongly with technical terms and jargon - things that carry strong meaning but are not compositional.
+** ...see e.g. http://www.ozdic.com/ for examples
-* point out that a lot of collocations are not compositional, so when some adjective-noun combination doesn't seem to make direct sense (e.g. bright idea), you can assume it's some sort of expression you should look up
-* overlap strongly with technical terms and jargon - things that carry strong meaning but are not compositional.
-...see e.g. http://www.ozdic.com/ for examples
-Linguistics sometimes focuses on preferences between words that lexically speaking might be interchangeable, but practically have preferences (which you can see as not being purely [[compositional]])
-Consider:
-* [[adjective]]-[[noun]], often or preferred adjective used to make a noun stronger or more specific, e.g. maiden voyage, excruciating pain, bright idea, spare time, broad daylight, stiff breeze, strong tea. Alternative adjectives would typically be ''understood'' but be considered some amount of unusual
-* [[adverb]]-[[adjective]], e.g. downright amazed, fully aware
-* [[verb]]-[[adverb]],  e.g. prepare feverishly, wave frantically
-* [[verb]]-[[preposition]] pairs, e.g. agree with
-* [[verbs]]-[[noun]], e.g. we make rather than take a decision, we make rather than tidy a bed
-* [[noun]] combinations: a ''surge'' of ''anger'', a ''bar'' of ''soap'', a ''ceasefire'' ''agreement''
-* ...and more
-In a more applied sense
 * Collocations matter to translation.
-: In theory they are harder in that word-for-word translation will be wrong (not being compositional),
+: collocations make translation harder in that word-for-word translation will be wrong (not being compositional)
-: but at the same time, may be ''easier'' when you detect these (or happen to, as statistical translation might), as their meaning is more singular.
+: since collocation analysis points out a sequence is idiosyncratic, it can makes it easier to detect that, which helps focus on learning what it might correspond to
 * natural language ''generation'' would like to know these preferred combinations
@@ Line 531: / Line 508: @@
 * some uses use reveals cultural attitudes, e.g. which adjectives we use for behaviour of specific groups
+* linguistics may study smaller idiomatic preferences - say, in "VERB a decision", you would probably prefer 'make' over 'take' or most other verbs. Similarly,
+:: [[adjective]]-[[noun]], often or preferred adjective used to make a noun stronger or more specific, e.g. maiden voyage, excruciating pain, bright idea, spare time, broad daylight, stiff breeze, strong tea. Alternative adjectives would typically be ''understood'' but be considered some amount of unusual
+:: [[adverb]]-[[adjective]], e.g. downright amazed, fully aware
+:: [[verb]]-[[adverb]],  e.g. prepare feverishly, wave frantically
+:: [[verb]]-[[preposition]] pairs, e.g. agree with, care about
+:: [[verbs]]-[[noun]], e.g. we make rather than take a decision, we make rather than tidy a bed
+:: [[noun]] combinations: a ''surge'' of ''anger'', a ''bar'' of ''soap'', a ''ceasefire'' ''agreement''
+* it might bring up other patterns useful to natural language parsing - e.g. we agree with someone, we agree on something
-If you see 'collocation anlysis' mentioned as a method near some math, it is the statistics that helps reveal them.
-: including the assumptions that method makes (e.g. how the baseline expectation is defined), filtering to avoid cases you don't care about
-: and not just the human-curated-and-categorized cases.
-====Collocation analysis====
+===Collocation analysis===
 <!--
-When you see the word 'collocation' ''and'' some math,
+If you see 'collocation anlysis' mentioned as a method near some math,
-you're often looking at something that tries to find words that seem unusually common - idiosyncratic wording, and often phrases - .
+it points at some statistics that helps list such less-usual sequences.
-The math usually asks "do these words occur together more often than the occurrence of each individually would suggest?".
+...and not the human-curated-and-categorized cases of interesting things.
-This implies you have previously calculated the probability of each word.
-While not a strongly controlled source
-The choice of reference also determines kind of focus that the results will have.
-If that reference is a larger analysis of that language,
+There is no singular method - the simplest variants are pretty noisy,
-you are likelier to find any and all idiosyncratic words.
+adding filtering and assumptions is cleaner but ''may'' remove some interesting things.
-If that reference comes from a dataset the document you're looking at,
+Systems that implement collocation, and the articles that describe them,
-you are likelier to notice
+are often clear about the final scoring,
+but may vary in all the steps before this and call it pragmatism.
-'''Choices in math'''
-Just counting how often words occur, compared to just the counts of other pairs, won't get you very far.
-The math is a little against us here.
-* the set of all possible word combinations is by nature  an incredibly large set
-:: say, if your language has 50000 words then there are 2.5 billion 2-grams, and 125 trillion 3-grams
-* the combinations that we will see is incredibly sparse.
-::
-* because words use has [[Power_law|Zipfian tendencies]], most of the combinations we record will be biased to involve mostly the top of that list
-: ...meaning much of what we store in collocation analysis involves one or more function words
-: if we're dividing by their occurrence to try to make that go away in the results, but it only does so much
-* if you ignore very common words in a stopword-like way, that would also be (fairly arbitrarily) removing the ability to deal with phrases that involve them
-: which is a reasonable amount of them. Consider e.g. rock and roll, through the grapevine, etc.
-* we overvalue things involving words that are rare
-: you can clean up a lot of results saying "ignore anything that involves rare unigrams"
-: ...but the line between useful and useless depends on what you are doing this analysis for.
-So many methods will try to correct for how (un)informative words are,
-for example by comparing ''combined'' appearance against ''expected'' appearance.
-We look to things like [[log likelihood ratio]], Pointwise Mutual Information (PMI), which are similar ideas (also related to entropy, and (naive) Bayes).
-It varies exactly how that estimation works, and the assumptions you may, how thorough the model is (many do ''not'' go as far as distribution estimation or [[smoothing sparse data|smoothing]] the inherently sparse data), or even preferences you build in.
-One major detail is that various mathematical approaches (e.g. plain 'chance of combination divided by chance of appearing individually') will overvalue the rare - including tokens that are rare for any reason, not unusually because because they are misspelled, while all actual phrases are further down the list.
-And that makes some sense, in that we are looking for unusual things,
-yet we are typically looking for ''patterns'' that are less usual,
-which almost by definition lies somewhere ''between'' completely regular and extremely unusual.
-It may be useful to have a 'how unusual' parameter to the model,
-because it also lets you tune it between 'specialized jargon' levels and 'general associations' levels.
-    https://github.com/rtapiaoregui/collocater
-    * the varied data file seems to suggest mainly detecting known ones?
-    https://radimrehurek.com/gensim/models/phrases.html
-    https://pitt.libguides.com/textmining/collocation
-    https://python.plainenglish.io/collocation-discovery-with-pmi-3bde8f351833
+Some give much cleaner results than others, for a handful of different reasons.
+We would like to know what these are, at least roughly.
 -->
-=====Pointwise Mutual Information=====
+====The reference probabilities====
 <!--
+Asking "do these words occur together more often than the occurrence of each individually would suggest?"
+implies you already have probabilities for each word.
-Pointwise Mutual Information (PMI)
+While this isn't about comparing texts,
-quantifies coincidence from their joint distribution and their individual distributions,
+you ''do'' have to have a baseline of how likely each word is (comparatively).
-assuming independence.
-A little more concretely, PMI calculates probability of one word following another divided by the probability of each appearing at all:
- PMI(word1,word2) = log(  P(word1,word2) / ( P(word2) * P(word1) )  )
-Notes:
-* the specific log used isn't important when this is a unitless used only for relative ranking
-* 'Pointwise' mostly just points out we are looking at specific cases at a time. [https://en.wikipedia.org/wiki/Mutual_information Mutual Information] without the 'pointwise' used on variables/distributions (e.g. based on ''all possible events'')
-PMI has a number of limitations
-* at least the classical puts too much value on low-frequency events
-: that is, when individual words don't occur a lot, collocations with them will have a small denominator and are easily overvalued.
-:: this not only overvalues much more specialized relations over general ones, it tends to lift out spelling mistakes as the strongest relations, while putting actual phrases further down
-::: This is not technically ''wrong'' (information-theoretically, common things are less informative), it just highlights that basic PMI isn't ''quite'' want we want in most practical cases
-:: can be reduced by raising the deonominator to some power (see PMI<sup>k</sup>), some global correction, or e.g. smoothing with some extra counts
-* if comparing unseen text, will not report anything involving unknown words no matter how common
-: which may be specifically interesting
-* The actual value PMI gives is not bounded, or particularly linear, which makes it hard to interpret
-* values can be negative, which is hard to interpret
-: in a direct sense this means things occur ''less'' than you would expect by chance. This is ''sometimes'' meaningful{{verify}}, but more often comes from too little evidence in the training data.
-:: PPMI works around this - with a output = max(0, output)
-* Values can be -inf, for things that never occur together, which seems a bit excessive
-:: PPMI and NPMI won't
-Variants
-'''Positive Pointwise Mutual Information (PPMI)'''
-Negative PMI means very little occurrence, it's roughly valid to just ignore them - consider them to be zero.
-'''PMI<sup>k</sup>''' (Daille 1994)
- PMI<sup>k</sup>(word1,word2) = log(  P(word1,word2)<sup>k</sup> / ( P(word2) * P(word1) )  )
-: k=1 is classical PMI
-: k=2 or 3 seem suggested
-: intendeds to lessen the the bias of classical PMI towards low-frequency events, by boosting the scores of frequent pairs
+For example, compare:
+: if you train it on legal text and then run it on that legal text,
+: if you train it on general text and then run it on that legal text
+...chances are the output is ''similar'', but the first may show fewer formulaic phrases.
-'''Normalized pointwise mutual information (NPMI)''' (Bouma 2006)
+Not because the ''formulaic sequences'' are learned,
- NPMI(a, b) = PMI(a, b) − log p(a, b)
+but just because those formulaic sequences are common and increase the counts for their constituent words.
-: is bound to -1..1, where
-:: -1 for things that never occur together
-:: 0 means independence
-:: 1 for complete co-occurrence
+(For similar reasons, the larger your document set is, the more you can even get away with training on the data itself)
+Given a good training set, we can usually deal fine with unseen documents well,
+though there is a small question in what to do with words that we haven't seen before.
+Generally we assume those are rare, so assign those a probability near the bottom of what we ''do'' have.
-Notes
-* PMI<sup>k</sup>(a, b) can also be expressed as
- PMI(a, b) − (−(k − 1) log p(a, b)).
-: which helps compare it with NPMI
-* For some comparison, see Role & Nadif 2011
 -->
+====Filters and assumptions====
-See also:
-* https://en.wikipedia.org/wiki/Pointwise_mutual_information
-* K Church, P Hanks (1990) "{{search|Word Association Norms, Mutual Information, and Lexicography}}"
-* B Daille (1994) "{{search|Approche mixte pour l'extraction automatique de terminologie : statistiques lexicales et filtres linguistiques. Thèse de Doctorat en Informatique Fondamentale}}
-* C Manning & H Schütze (1999) "{{search|Foundations of Statistical Natural Language Processing}}"
-* G Bouma (2006) "{{search|Normalized (pointwise) mutual information in collocation extraction}}"
-* W Croft et al. (2010) "Search engines: information retrieval in practice"
-* F Role, M Nadif (2011), "Handling the Impact of Low Frequency Events on Co-occurrence based Measures of Word Similarity - A Case Study of Pointwise Mutual Information"
-* T Mikolov et at. (2013), "{{search|Distributed Representations of Words and Phrases and their Compositionality}}"
 <!--
+Some things it does well out of the box. Say, consider for example a letter repeating a full name. We have a probability of the words that make up the parts, but repeating that sequence is statistically unusual.
+Similarly, [[idiomatic]] preferences (and MWEs and other combinations out of historical/idiosyncratic habit)
+tend to roll out fairly well,
+because the least-marked sequence among alternatives ''implies'' seeing specific words together more often.
-Code may make some further choices about analysis.
+Yet the simplest methods may output a good portion of sequences like:
+"that for the" where yes, we understand that appears more often, but it's not ''interesting''.
-For example
-* NLTK's co-occurrence counts words as co-occuring if they appear together within some amount of distance (window)
-: because that increases the amount of counts, scoring will be divided by
-* some may use [[skip-grams]] to similar effect
-* nltk actually has a number of different metrics it can use to score, PMI is just one of them
-* gensim.phrases allows connector words (e.g. [[articles]] a, and, the, [[prepositions]] for, of, with, without, from, to, in, on, by, and [[conjuctions]] and and or) which
+Maybe we are primarily interested in, say, the kinds of terms this book uses that other books do not.
-** will not start or end a phrase
+Perhaps just the noun phrases.
-** allows phrases to contain any mount of them (e.g. eye_of_the_beholder, rock_and_roll) without affecting the scoring.
+Maybe you want to be able to count that with and/or without the adjectives stuck on front of them.
-* There are explosively many combinations of words.
+That all suggests that we do some POS analysis of the text we are dealing with and remove everything
-** you may well want to remove stopwords - they will dominate your statistics - except consider connector-word constructions
+not matching a particular pattern.
-** gensim keeps a top list (40 million items by default, see max_vocab_size) to keep RAM use to something roughly constant and predictable
+That ''is'' going to give you cleaner output.
+...just be sure that everything this removes isn't interesting to you.
-'''N-gram PMI'''
+Which is hard.
-https://python.plainenglish.io/collocation-discovery-with-pmi-3bde8f351833
 -->
-=====Hidden pragmatism=====
 <!--
-Articles and systems that implement collocation are often clear about the final scoring,
-but may vary in all the steps before this and call it pragmatism.
-Some give much cleaner results than others, for a handful of different reasons.
-We would like to know what these are, at least roughly.
--->
-======What do you compare?======
-<!--
-While you could do such a calculation within a single document,
-this is typically too sparse a source of words, let alone words combinations,
-so the result of this analysis on different documents can often be described as 'distracted by different things'.
-We tend to train on a large corpus, then look at documents.
 Possibly all documents of that same corpus, though a lot of real-world systems will also care for these systems to say
 something useful about unseen records.
@@ Line 811: / Line 621: @@
 -->
+====Choices in math - and avoiding an explosion of data====
-======Dealing with sparsity, and avoiding an explosion of data======
 <!--
+Just counting how often words occur, compared to just the counts of other pairs, won't get you very far.
-'''Sliding windows and skip-grams'''
+The math is somewhat against us here.
-For a text of k words we may see at most k unique 2-grams,
+* the set of all possible word combinations is by nature an incredibly large set
-yet they come from a space of up to n<sup>2</sup> possible 2-grams.
+:: say, if your language has 50000 possible words, then there are 2.5 billion 2-grams, and 125 trillion 3-grams
+* the combinations that we will actually find in text is incredibly sparse
+:: if you were e.g. to make a 50000-by-50000 table for 2-grams, most cells would be 0
+* because words in languages have [[Power_law|Zipfian tendencies]], most of the combinations we record will involve one or more semantically-empty function words from the top of that list (the, of, and, be, to, a, in, that)
+* if you ignore very common words in a stopword-like way, that would also be (fairly arbitrarily) removing the ability to deal with phrases that involve them
+: which is a reasonable amount of them. Consider e.g. rock and roll, through the grapevine, etc.
-This is an explosive amount of data we ''potentially'' see.
+* we overvalue things involving words that are rare
-Assuming we have 50000 words, that's two billion combinations, of which we will see ''most'' combinations 0 or 1 times even in very large datasets.
+: you can clean up a lot of results saying "ignore anything that involves very-rare unigrams"
+: ...but whether that is useful, or removes what you are looking for, depends a lot on what you are doing this analysis for.
-So we tend to count this sparsely. {{comment|(We don't have an n-by-n matrix, we have a hashmap with entries. (Though note this is more about "keeping it in a single PC's RAM to keep access times reasonable", not a hard limitation on what you couldn't or shouldn't do))}}
+Many methods will try to correct for how (un)informative words are,
+for example by comparing ''combined'' appearance against ''expected'' appearance.
-When you introduce [[skip-grams]], the sparsity issue gets better at the cost of a lot more entries.
+We look to things like [[log likelihood ratio]], Pointwise Mutual Information (PMI), which are similar ideas (also related to entropy, and (naive) Bayes).
-When you introduce n-grams for n&ge;3, to find longer collocations, you also get a lot more entries.
+It varies exactly how that estimation works, and the assumptions you may, how thorough the model is (many do ''not'' go as far as distribution estimation or [[smoothing sparse data|smoothing]] the inherently sparse data), or even preferences you build in.
-Given a large enough dataset we may ''tend'' towards a very similar large amount of unique n-grams
-(because we will eventually see all reasonably possible combinations).
+One major detail is that various mathematical approaches (e.g. plain 'chance of combination divided by chance of appearing individually') will overvalue the rare - including tokens that are rare for any reason, not unusually because because they are misspelled, while all actual phrases are further down the list.
+And that makes some sense, in that we are looking for unusual things,
+yet we are typically looking for ''patterns'' that are less usual,
+which almost by definition lies somewhere ''between'' completely regular and extremely unusual.
-'''Filtering input more'''
+It may be useful to have a 'how unusual' parameter to the model,
+because it also lets you tune it between 'specialized jargon' levels and 'general associations' levels.
-If you have an endless source of text (say, the internet),
-then you could choose to filter in only evidence that is relatively clear,
-particularly if you're doing things like skip-grams.
+https://github.com/rtapiaoregui/collocater
+* the varied data file seems to suggest mainly detecting known ones?
-You could enter skip-gram evidence ''only'' if there is more than one occurrence in a document,
+https://radimrehurek.com/gensim/models/phrases.html
-the assumption being that one occurrence may be noise, and if it's a real thing,
-other documents will use it more and entier it for us.
+https://pitt.libguides.com/textmining/collocation
-We could keep a long-term vocabulary based on such ''filtered'' evidence,
+https://python.plainenglish.io/collocation-discovery-with-pmi-3bde8f351833
-so that multiple passes over the same dataset will avoid the bulk of low-evidence -grams.
-This is still bias, yes; a single document mentioning it twice would weigh more than
-a thousand mentioning it once, but these are ''reasonable'' assumptions in most cases.
-(you might be tempted to filter out the top as well, but it would barely make a difference - they will be the majority of ''counts'', yes, but just a small amount of the entries you're keeping)
 -->