Some relatively basic text processing

Type-token ratio

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

In analysis of text, token refers to individual words, and type to the amount of unique words.

Type-token ratio is the division of those two. This is a crude measure of the lexical complexity in text.

For example, if a 1000-word text has 250 unique words, it has a type/token ratio of 0.25, which is relatively low and suggests a simpler writing style -- or possibly legalese that happens to repeat phrases for less ambiguity than having heavily referential sentence structures.

You may wish to apply stemming or lemmatization before calculating a type-token ratio, when you want to avoid counting simple inflected variations as separate tokens.

Type-token ratios of different-sized texts are not directly comparable, as words in text usually follows a power law type of distribution, implying that longer texts will almost necessarily show a larger increase of tokens than of types.

One semi-brute way around this is to calculate type-token figures for same-sized chunks of the text, then average the ratios into a single figure (the best chunk size is not a clear-cut decision, though it still doesn't work too well if some documents are much shorter than the chunk size).

Type-token ratio is often one among various statistical figures/factors in text/corpus summaries, because it's a rough heuristic for some things.

Consider plagiarism detection: if text was copied and reworded only slightly, it will have a comparable type-token ratio.



