Some relatively basic text processing: Difference between revisions
m (→Augmentations) |
mNo edit summary |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
{{ling}} | |||
===Type-token ratio=== | ===Type-token ratio=== | ||
Line 42: | Line 44: | ||
'''For context''' | '''For context''' | ||
Say you make a table | Say you make a table that counts all words in all documents | ||
: the rows | |||
: the columns are each of the | : the rows correspond to individial documents | ||
: so the cells count how often a | |||
: the columns are each of the words (that appear amongst ''all'' documents) | |||
: so the cells count how often a word appears in a specific document | |||
Something like that is potentially useful to compare documents, right? | Something like that is potentially useful to compare documents, right? | ||
And perhaps compare terms? | |||
Yes, but only so far - it turns out ''just'' counting is skewed in multiple ways, including (but not limited to): | |||
* longer documents would have higher numbers not because terms are more important but because longer documents probably have more of that word | |||
* longer documents would have higher numbers not because terms are more important but because longer documents have more of | |||
:: so scoring by just counts would make comparing documents impossible | :: so scoring by just counts would make comparing documents impossible | ||
:: okay, normalization would help | :: okay, normalization would help | ||
* common words would have high counts just because, well, they are common. [ | * common words would have high counts just because, well, they are common. [[Zipf]]'s law points out what this distribution usually looks like -- and that the first dozen words (often semantically empty function words like the, to, be, of, and, in, a, that) will cover 10 to 20% of all words. | ||
:: so scoring by just counts, normalized or not, would still disproportionally emphasize these mostly-empty terms. | :: so scoring by just counts, normalized or not, would still disproportionally emphasize these mostly-empty terms. | ||
Latest revision as of 14:14, 17 October 2023
Type-token ratio
In analysis of text, token refers to individual words, and type to the amount of unique words.
Type-token ratio is the division of those two. This is a crude measure of the lexical complexity in text.
For example, if a 1000-word text has 250 unique words, it has a type/token ratio of 0.25, which is relatively low and suggests a simpler writing style -- or possibly legalese that happens to repeat phrases for less ambiguity than having heavily referential sentence structures.
You may wish to apply stemming or lemmatization before calculating a type-token ratio, when you want to avoid counting simple inflected variations as separate tokens.
Type-token ratios of different-sized texts are not directly comparable,
as words in text usually follows a power law type of distribution,
implying that longer texts will almost necessarily show a larger increase of tokens than of types.
One semi-brute way around this is to calculate type-token figures for same-sized chunks of the text, then average the ratios into a single figure (the best chunk size is not a clear-cut decision, though it still doesn't work too well if some documents are much shorter than the chunk size).
Type-token ratio is often one among various statistical figures/factors in text/corpus summaries,
because it's a rough heuristic for some things.
Consider plagiarism detection: if text was copied and reworded only slightly, it will have a comparable type-token ratio.