Some relatively basic text processing
Type-token ratio
In analysis of text, token refers to individual words, and type to the amount of unique words.
Type-token ratio is the division of those two. This is a crude measure of the lexical complexity in text.
For example, if a 1000-word text has 250 unique words, it has a type/token ratio of 0.25, which is relatively low and suggests a simpler writing style -- or possibly legalese that happens to repeat phrases for less ambiguity than having heavily referential sentence structures.
You may wish to apply stemming or lemmatization before calculating a type-token ratio, when you want to avoid counting simple inflected variations as separate tokens.
Type-token ratios of different-sized texts are not directly comparable,
as words in text usually follows a power law type of distribution,
implying that longer texts will almost necessarily show a larger increase of tokens than of types.
One semi-brute way around this is to calculate type-token figures for same-sized chunks of the text, then average the ratios into a single figure (the best chunk size is not a clear-cut decision, though it still doesn't work too well if some documents are much shorter than the chunk size).
Type-token ratio is often one among various statistical figures/factors in text/corpus summaries,
because it's a rough heuristic for some things.
Consider plagiarism detection: if text was copied and reworded only slightly, it will have a comparable type-token ratio.