Some relatively basic text processing
|This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
In analysis of text, token refers to individual words, and type to the amount of unique words.
Type-token ratio is the division of those two. This is a crude measure of the lexical complexity in text.
For example, if a 1000-word text has 250 unique words, it has a type/token ratio of 0.25, which is relatively low and suggests a simpler writing style -- or possibly legalese that happens to repeat phrases for less ambiguity than having heavily referential sentence structures.
Type-token ratios of different-sized texts are not directly comparable, as words in text usually follows a power law type of distribution, implying that longer texts will almost necessarily show a larger increase of tokens than of types.
One semi-brute way around this is to calculate type-token figures for same-sized chunks of the text, then average the ratios into a single figure (the best chunk size is not a clear-cut decision, though it still doesn't work too well if some documents are much shorter than the chunk size).
Type-token ratio is often one among various statistical figures/factors in text/corpus summaries, because it's a rough heuristic for some things.
Consider plagiarism detection: if text was copied and reworded only slightly, it will have a comparable type-token ratio.