Some relatively basic text processing: Difference between revisions

From Helpful
Jump to navigation Jump to search
mNo edit summary
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
{{ling}}


===Type-token ratio===
===Type-token ratio===
Line 42: Line 44:
'''For context'''
'''For context'''


Say you make a table
Say you make a table that counts all words in all documents
: the rows are the name of a document
 
: the columns are each of the terms (words that appear amongst ''all'' documents)
: the rows correspond to individial documents
: so the cells count how often a term appears in a specific document
 
: the columns are each of the words  (that appear amongst ''all'' documents)
 
: so the cells count how often a word appears in a specific document
 
 


Something like that is potentially useful to compare documents, right?  
Something like that is potentially useful to compare documents, right?  
And perhaps terms too?


And perhaps compare terms?
Yes, but only so far - it turns out ''just'' counting is skewed in multiple ways, including (but not limited to):


Yes, but only so far - it turns out ''just'' counting is skewed in multiple ways. Consider that:
* longer documents would have higher numbers not because terms are more important but because longer documents probably have more of that word
* longer documents would have higher numbers not because terms are more important but because longer documents have more of them.
:: so scoring by just counts would make comparing documents impossible
:: so scoring by just counts would make comparing documents impossible
:: okay, normalization would help
:: okay, normalization would help


* common words would have high counts just because, well, they are common. [https://en.wikipedia.org/wiki/Zipf%27s_law Zipf's law] points out what this distribution usually looks like -- and that the first dozen words (often semantically empty function words like the, to, be, of, and, in, a, that) will cover 10 to 20% of all words.
* common words would have high counts just because, well, they are common. [[Zipf]]'s law points out what this distribution usually looks like -- and that the first dozen words (often semantically empty function words like the, to, be, of, and, in, a, that) will cover 10 to 20% of all words.
:: so scoring by just counts, normalized or not, would still disproportionally emphasize these mostly-empty terms.
:: so scoring by just counts, normalized or not, would still disproportionally emphasize these mostly-empty terms.



Latest revision as of 14:14, 17 October 2023

Language units large and small

Marked forms of words - Inflection, Derivation, Declension, Conjugation · Diminutive, Augmentative

Groups and categories and properties of words - Syntactic and lexical categories · Grammatical cases · Correlatives · Expletives · Adjuncts

Words and meaning - Morphology · Lexicology · Semiotics · Onomasiology · Figures of speech, expressions, phraseology, etc. · Word similarity · Ambiguity · Modality ·

Segment function, interaction, reference - Clitics · Apposition· Parataxis, Hypotaxis· Attributive· Binding · Coordinations · Word and concept reference

Sentence structure and style - Agreement · Ellipsis· Hedging

Phonology - Articulation · Formants· Prosody · Sound change · Intonation, stress, focus · Diphones · Intervocalic · Glottal stop · Vowel_diagrams · Elision · Ablaut_and_umlaut · Phonics

Speech processing · Praat notes · Praat plugins and toolkit notes · Praat scripting notes

Analyses, models, software - Minimal pairs · Concordances · Linguistics software · Some_relatively_basic_text_processing · Word embeddings · Semantic similarity

Unsorted - Contextualism · · Text summarization · Accent, Dialect, Language · Pidgin, Creole · Natural language typology · Writing_systems · Typography, orthography · Digraphs, ligatures, dipthongs · More linguistic terms and descriptions · Phonetic scripts


Type-token ratio

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

In analysis of text, token refers to individual words, and type to the amount of unique words.

Type-token ratio is the division of those two. This is a crude measure of the lexical complexity in text.

For example, if a 1000-word text has 250 unique words, it has a type/token ratio of 0.25, which is relatively low and suggests a simpler writing style -- or possibly legalese that happens to repeat phrases for less ambiguity than having heavily referential sentence structures.


You may wish to apply stemming or lemmatization before calculating a type-token ratio, when you want to avoid counting simple inflected variations as separate tokens.


Type-token ratios of different-sized texts are not directly comparable, as words in text usually follows a power law type of distribution, implying that longer texts will almost necessarily show a larger increase of tokens than of types.

One semi-brute way around this is to calculate type-token figures for same-sized chunks of the text, then average the ratios into a single figure (the best chunk size is not a clear-cut decision, though it still doesn't work too well if some documents are much shorter than the chunk size).


Type-token ratio is often one among various statistical figures/factors in text/corpus summaries, because it's a rough heuristic for some things.

Consider plagiarism detection: if text was copied and reworded only slightly, it will have a comparable type-token ratio.


TF-IDF

Basics

Upsides and limitations

Scoring in search

Augmentations