Some relatively basic text processing

From Helpful
Jump to navigation Jump to search

Language units large and small

Marked forms of words - Inflection, Derivation, Declension, Conjugation · Diminutive, Augmentative

Groups and categories and properties of words - Syntactic and lexical categories · Grammatical cases · Correlatives · Expletives · Adjuncts

Words and meaning - Morphology · Lexicology · Semiotics · Onomasiology · Figures of speech, expressions, phraseology, etc. · Word similarity · Ambiguity · Modality ·

Segment function, interaction, reference - Clitics · Apposition· Parataxis, Hypotaxis· Attributive· Binding · Coordinations · Word and concept reference

Sentence structure and style - Agreement · Ellipsis· Hedging

Phonology - Articulation · Formants· Prosody · Sound change · Intonation, stress, focus · Diphones · Intervocalic · Glottal stop · Vowel_diagrams · Elision · Ablaut_and_umlaut · Phonics

Speech processing · Praat notes · Praat plugins and toolkit notes · Praat scripting notes

Analyses, models, software - Minimal pairs · Concordances · Linguistics software · Some_relatively_basic_text_processing · Word embeddings · Semantic similarity

Unsorted - Contextualism · · Text summarization · Accent, Dialect, Language · Pidgin, Creole · Natural language typology · Writing_systems · Typography, orthography · Digraphs, ligatures, dipthongs · More linguistic terms and descriptions · Phonetic scripts

Type-token ratio

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

In analysis of text, token refers to individual words, and type to the amount of unique words.

Type-token ratio is the division of those two. This is a crude measure of the lexical complexity in text.

For example, if a 1000-word text has 250 unique words, it has a type/token ratio of 0.25, which is relatively low and suggests a simpler writing style -- or possibly legalese that happens to repeat phrases for less ambiguity than having heavily referential sentence structures.

You may wish to apply stemming or lemmatization before calculating a type-token ratio, when you want to avoid counting simple inflected variations as separate tokens.

Type-token ratios of different-sized texts are not directly comparable, as words in text usually follows a power law type of distribution, implying that longer texts will almost necessarily show a larger increase of tokens than of types.

One semi-brute way around this is to calculate type-token figures for same-sized chunks of the text, then average the ratios into a single figure (the best chunk size is not a clear-cut decision, though it still doesn't work too well if some documents are much shorter than the chunk size).

Type-token ratio is often one among various statistical figures/factors in text/corpus summaries, because it's a rough heuristic for some things.

Consider plagiarism detection: if text was copied and reworded only slightly, it will have a comparable type-token ratio.



Upsides and limitations

Scoring in search