Some relatively basic text processing

From Helpful
Jump to navigation Jump to search

Language units large and small

Marked forms of words - Inflection, Derivation, Declension, Conjugation · Diminutive, Augmentative

Groups and categories and properties of words - Syntactic and lexical categories · Grammatical cases · Correlatives · Expletives · Adjuncts

Words and meaning - Morphology · Lexicology · Semiotics · Onomasiology · Figures of speech, expressions, phraseology, etc. · Word similarity · Ambiguity · Modality ·

Segment function, interaction, reference - Clitics · Apposition· Parataxis, Hypotaxis· Attributive· Binding · Coordinations · Word and concept reference

Sentence structure and style - Agreement · Ellipsis· Hedging

Phonology - Articulation · Formants· Prosody · Intonation, stress, focus · Diphones · Intervocalic · Lenition · Glottal stop · Vowel_diagrams · Elision · Ablaut_and_umlaut · Phonics


Analyses, models, software - Minimal pairs · Concordances · Linguistics software · Some_relatively_basic_text_processing · Word embeddings · Semantic similarity

Unsorted - Contextualism · Text summarization · Accent, Dialect, Language · Pidgin, Creole · Writing_systems · Typography, orthography · Digraphs, ligatures, dipthongs · Onomastics



Type-token ratio

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

In analysis of text, token refers to individual words, and type to the amount of unique words.

Type-token ratio is the division of those two. This is a crude measure of the lexical complexity in text.

For example, if a 1000-word text has 250 unique words, it has a type/token ratio of 0.25, which is relatively low and suggests a simpler writing style -- or possibly legalese that happens to repeat phrases for less ambiguity than having heavily referential sentence structures.


You may wish to apply stemming or lemmatization before calculating a type-token ratio, when you want to avoid counting simple inflected variations as separate tokens.


Type-token ratios of different-sized texts are not directly comparable, as words in text usually follows a power law type of distribution, implying that longer texts will almost necessarily show a larger increase of tokens than of types.

One semi-brute way around this is to calculate type-token figures for same-sized chunks of the text, then average the ratios into a single figure (the best chunk size is not a clear-cut decision, though it still doesn't work too well if some documents are much shorter than the chunk size).


Type-token ratio is often one among various statistical figures/factors in text/corpus summaries, because it's a rough heuristic for some things.

Consider plagiarism detection: if text was copied and reworded only slightly, it will have a comparable type-token ratio.


TF-IDF

Basics

Upsides and limitations

Scoring in search

Augmentations