Varied text processing
Type-token ratio
In analysis of text, token refers to individual words, and type to the amount of unique words.
Type-token ratio is the division of those two. This is a crude measure of the lexical complexity in text.
For example, if a 1000-word text has 250 unique words, it has a type/token ratio of 0.25, which is relatively low and suggests a simpler writing style ...or possibly legalese that happens to repeat phrases for less ambiguity than having heavily referential sentence structures.
You may wish to apply stemming or lemmatization before calculating a type-token ratio, when you want to avoid counting simple inflected variations as separate tokens.
Type-token ratios of different-sized texts are not directly comparable,
as words in text usually follows a power law type of distribution,
implying that longer texts will almost necessarily show a larger increase of tokens than of types.
One semi-brute way around this is to calculate type-token figures for same-sized chunks of the text, then average the ratios into a single figure (the best chunk size is not a clear-cut decision, though it still doesn't work too well if some documents are much shorter than the chunk size).
Type-token ratio is often one among various statistical figures/factors in text/corpus summaries,
because it's a rough heuristic for some things.
Consider plagiarism detection: if text was copied and reworded only slightly, it will have a comparable type-token ratio.
Supporting ideas
TF-IDF
Basics
tf-idf upsides and limitations
Scoring in search
Augmentations
Bag of words / bag of features
The bag-of-words model (more broadly bag-of-features model) use the collection of words in a context, unordered, in a multiset, a.k.a. bag.
In other words, we summarize a document (or part of it) it by appearance or count of words, and ignore things like adjacency and order - so any grammar.
In text processing
In introductions to Naive Bayes as used for spam filtering, its naivety essentially is this assumption that feature order does not matter.
Though real-world naive bayes spam filtering would take more complex features than single words (and may re-introduce adjacenct via n-grams or such), examples often use 1-grams for simplicity - which basically is bag of words, exc.
Other types of classifiers also make this assumption, or make it easy to do so.
Bag of features
While the idea is best known from text, hence bag-of-words, you can argue for bag of features, applying it to any kind of feature you you can consider independently and count.
For example, you may follow up object detection in an image with logic like "if this photo contains a person, and a dog, a tree, and grass" because each task may be easy enough individually, and the combination tends to narrow down what kind of photo it is.
In practice, the bag-of-features often refers to models that recognize parts of a whole object (e.g. "we detected a bunch of edges of road signs" might be easier and more robust than detecting it fully), and used in a number image tasks, such as feature extraction, object/image recognition, image search, (more flexible) near-duplicate detection, and such.
The idea that you can describe an image by the collection of small things we recognize in it, and that combined presence is typically already a strong indicator (particularly when you add some hypothesis testing). Exact placement can be useful, but often easily secondary.
See also:
"Linear bag-of-words"
"n-gram bag-of-words"
N-gram notes
N-grams are contiguous sequence of length n.
They are most often seen in computational linguistics.
Applied to sequences of characters it can be useful e.g. in language identification,
but the more common application is to words.
As n-grams models only include dependency information when those relations are expressed through direct proximity, they are poor language models, but useful to things working off probabilities of combinations of words, for example for statistical parsing, collocation analysis, text classification, sliding window methods (e.g. sliding window POS tagger), (statistical) machine translation, and more
For example, for the already-tokenized input This is a sentence . the 2-grams would be:
- This is
- is a
- a sentence
- sentence .
...though depending on how special you do or do not want to treat the edges, people might fake some empty or special tokens at the edge.
Skip-grams
Skip-grams seem to refer to now two different things.
What they share is the consideration of words from a text, where the parts it pick up occur at distance of exactly or at most k from each other.
Skip-gram tuples
The classical skip-grams are an extension of the n-gram idea, but allowing the process to skip words - they do not have to be consecutive.
This has varied implementations for varied goals.
Sometimes just to skip stopwords (n-grams blind to specific tokens).
Sometimes at a fixed distance, skipping words (wider context, but selecting fewer).
Sometimes even more specific, such as just picking up only nouns, or avoiding determiners at the edge.
In word2vec side of things, skip-grams are meant as a "we skip the word in the middle, because that is what we are predicting"
In most of those definitions, skip-grams preserve order.
Others consider them unordered, because what they intend to do is less about structure and more about local co-occurrence, more like an bag of words altered for selectivity and/or window size.
Either way, the reason to use them is often to do n-gram things with less data sparsity, and without going as crazy on the amount of input, while still focusing on relations by close-ish context.
Skip-grams will lead to more output tuples - though usually only a factor more.
It also allows us to pick up patterns of words a little further away from each other without having to include (an explostion of) larger n-grams - because skipping makes you look further for the same-output-sized tuple.
Skip-grams apparently came from speech analysis, processing phonemes.(verify)
In word-level analysis their purpose is a little different. You could say that we acknowledge the sparsity problem, and decide to get more out of the data we have (focusing on context) rather than trying to smooth.
In any case, using skip-grams adds a few assumptions that you should be aware you are choosing.
Skip-gram neural word2vec
These days, a web search for skip-grams will often lead to something else. You will know it when you see it mentioned next to word2vec and cbow (or tok2vec, if you're reading spacy documentation), where word2vec is the idea of assigning vectors based on co-occurrence within a large amount of text.
"Skip-Gram Word2Vec" and "CBow Word2Vec" then are simply two specific ways (among others)
that answer how to use that large amount of text - see this paper.
Skip-gram here seems to actually point to a neural setup of input layer, embedding layer, and output layer.
More speifically,
See also:
- T Mikolov et al. (2013) "Efficient Estimation of Word Representations in Vector Space"
- D Guthrie et al. (2006) "A closer look at skip-gram modelling"
Syntactic n-grams
Flexgrams
Counting towards meaning
LSA, LSI
Latent Semantic Analysis is the application of Singular Value Decomposition on text analysis and search, primarily on words-per-document matrix sort of data (with variants, as there always are).
Frequently combined with cosine metric for distances.
It's one of various different things motivated via the distributional similarity observation (the idea that "a word is characterized by the company it keeps"), and that as a result, this analysis has some semantic value in that it would implicitly be doing some sort of collocation-style analysis and co-associations,
because part of what you do with it by feeding in real-world text is telling it about similarity and contrast in that text.
Latent Semantic Indexing (LSI) isn't so much a variant of this, as it is a use: using LSA's output to then index of text document the numbers/vectors came from. If you map queries to a vector in the same way, you can then do searches that are fuzzied in word choice.
Other uses include approximating the sense of unseen words, and assisting topic modelling.
It should be said that we now often either have better methods of doing that, or a methods we're using anyway can do these things as well.
Yet it's simpler to implement than most of them, so it's at least worth experimenting with to compare.
Topic modelling
Very roughly, topic classification and topic modelling are about grouping documents via representative topics.
Topic clustering tends to refer to more supervised methods - working off a labeled dataset, putting similar things next to each other.
Topic modelling tends to refer to barely-superised methods, where it tries to figure that out for you.
- and if that works at all, it will also tend to suggest discover hidden themes ('latent topics' if you're fancy) - often meaning it gives you terms in sets that it thinks may share a theme, while you didn't explicitly ask for those themes
- Maybe think about modeling as asking a random person on the street to summarize a pile of things by picking out some intersting terms.
Depending on your choice of algorithm, there isn't always the world of difference between the two - they may still share aspects such as
- working better if you ask it for a data-appropriate number of groups
- working better if the the 'documents' are more clearly defined
- a document can be assigned to more than one topic, but whether that helps you at all will depend
Topic modelling is also not really a singular goal or algorithm.
Some focus more on finding the most useful topics, attempting to classify a group of documents, or both.
- Heck, some are basically rule-based expert system or little more than some tf-idf trickery
- ...yet recently the term points primarily at techniques that need less supervision to do some decent group finding
Both differ from classification simply by that you didn't know all the distinguishing classes ahead of time (if you did, this might reduce to a much simpler classification task).
Topic modelling is arguably most useful as a way to explore datasets, in that it should be able to find some major distinguishing features that happen to be useful to cluster that particular set of documents.
Chances are it will also be able to tell you some interesting phrases it found in the process.
A 'topic' is often a set of phrases that seemed to belong together
- in theory, you might be able to put a name to each, allowing you to quickly figure out the broad lines of what it is describing.
- ...in most practice, a lot of topics tend to be fairly messy
See also
LDA
Latent Dirichlet Allocation
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
Putting semantic numbers to words and to more
Computers and people and numbers
vector space representations, word embeddings, and more
Just count in a big table?
Word embeddings
The "vectors maybe" and "how about tables" arguments above would, ideally, lead us towards no longer representing only the words but also, somehow, their basic sense, or at least some comparability to similar words.
You can manage to do that in many distinct ways.
Emphasis on both the many, and the distinct - you probably don't care to know the details of most of them.
There are some older distributional similarity approaches. Many of these were first efforts, simple to start doing, to get some decent and decently explainable results, but sometimes got distractd by les useful things, many were clunky to work with, and not hugely efficient, and many have issues that meant it was hard to throw a lot of data at them and still get reasonable results.
And a bunch of that came from working on there being an intermediate of high dimensional and sparse word vectors, e.g. where each one represented a specific context.
"Word embeddings" often refer to word vectors but more compactly somehow while still staying semantically useful.
- Dense compared to the input, and compared to simpler approaches
- method aside, an output vector of maybe two hundred dimensions already seems to e.g. give reasonable (dis)similarity for a good amount of basic jobs (which is a lot more compact than the ~100K recognized words you probably started from)
- Semantically useful in that however it packs what it cares about into numbers, it has properties that tend to be useful
- often focusing on such (dis)similarity comparisons - e.g. 'bat' may get a sense of animal, tool; of a verb, and a noun -- or rather, in this space appear close to animals, and tools, and verbs - or at least, each of those more than e.g. 'banana' does.
- Even if it may not perfectly encode each use, but it will not mix in bananas, and there are plenty of search and comparison tasks for which this is at the very least good assistance
"This amounts to even more up-front work, so why do this dense semantic thing?"
One reason is practical - classical methods run into some age-old machine learning problems like that the more dimensions you add, the more you run into issues like the curse of dimensionality, sparsity. There happen to be issues that various word embedding methods sidestep somewhat. By cheating, yes, but by cheating quite well for specific uses.
Also, putting all words in a single space lets us compare terms, sentences, and documents.
If the vectors are good (for the scope you have),
this can be a good approximation of semantic similarity.
If we can agree on a basis between uses (e.g. build a reasonable set of vectors per natural language) we might even be able to give a basic idea of e.g. what any unseen new document is about.
(The last turns out to be optimistic, for a slew of reasons. You can often get better results out of becoming a little domain-specific.)
"So how do you get these dense semantic vectors?"
There are varied ways.
You could start with well-annotated data (that e.g. helps define similarity/dissimilarity),
and that might be of higher quality in the end.
But it is hard to come by annotation for many aspects over an entire language, and it's a lot of work to even try - and that's still ignoring details like contextual ambiguity, analyses even people wouldn't quite agree on, the fact you have to impose a particular system so if it doesn't encode something you wanted, you have more work yet later, to fix that.
A recent trend is to put a little more trust in the assumptions of the distributional hypothesis,
e.g. that words in similar context will be comparable.
This helps in that we can now use non-annotated data. We need a lot more of it for comparable quality to a lot les unannotated data, yes, but people have collective produced a lot of text, internet and other.
Even this "use distributional hypothesis" angle isn't very specific, and has been was done in very varied ways over the years.
A short history might be interesting, but for now let's point out that recent technique is word2vec, which doesn't do a lot more than looking what appears in similar contexts. (Its view is surprisingly narrow - what happens in a small window in a lot of data will tend to be more consistent. A larger window is not only more work but often too fuzzy) Its math is apparently fairly like the classical matrix factorization.
Word embeddings often refers to learning vectors from context,
though there are more varied meanings (some conflicting),
so you may wish to read 'embeddings' as 'text vectors' and figure out yourself what the implementation actually is.
Static embeddings
Contextual word embeddings
Subword embeddings
The hashing trick (also, Bloom embeddings)
Now we have nicer numbers, but how how I use them?
vectors - unsorted
Moderately specific ideas and calculations
Collocations
Collocations are statistically idiosyncratic sequences.
The math that is often used to find these asks something like "do these adjacent words occur together more often than the individual occurrence of the separate word would suggest?".
...though ideally is a little more refined than that.
This doesn't ascribe any meaning, or comparability,
it just tends to signal anything from
jargon,
various substituted phrases,
empty habitual etiquette,
and many other things that go beyond purely compositional construction,
because why other than common sentence structures would they co-occur so often?
...actually, there are varied takes on how useful collocations are, and why.
latent semantic analysis
Latent Semantic Analysis (LSA) is the application of Singular Value Decomposition on text analysis and search.
random indexing
https://en.wikipedia.org/wiki/Random_indexing
Topic modeling
Roughly the idea given documents that are about a particular topic, one would expect particular words to appear in the each more or less frequently.
Assuming such documents sharing topics, you can probably find groups of words that belong to those topics.
Assuming each document is primarily about one topic, you can expect a larger set of documents to yield multiple topics, and an assignment of one or more of these topics, so act like a soft/fuzzy clustering.
This is a relatively weak proposal in that it relies on a number of assumptions, but given that it requires zero training, it works better than you might expect when those assumptions are met. (the largest probably being your documents having singular topics).
https://en.wikipedia.org/wiki/Topic_model
word2vec
See *2vec#word2vec