Data modeling, restructuring, and massaging
|This is more for overview of my own than for teaching or exercise.
NLP data massage
Bag of words / bag of features
The bag-of-words model (more broadly bag-of-features model) use the collection of words in a context, unordered, in a multiset, a.k.a. bag.
In other words, we summarize a document (or part of it) it by appearance or count of words, and ignore things like adjacency and order - so any grammar.
In text processing
In introductions to Naive Bayes as used for spam filtering, its naivety essentially is this assumption that feature order does not matter.
Though real-world naive bayes spam filtering would take more complex features than single words (and may re-introduce adjacenct via n-grams or such), examples often use 1-grams for simplicity - which basically is bag of words, exc.
Other types of classifiers also make this assumption, or make it easy to do so.
Bag of features
While the idea is best known from text, hence bag-of-words, you can argue for bag of features, applying it to anything you can count, and may be useful even when considered independently.
For example, you may follow up object detection in an image with logic like "if this photo contains a person, and a dog, and grass" because each task may be easy enough individually, and the combination tends to narrow down what kind of photo it is.
In practice, the bag-of-features often refers to models that recognize parts of a whole object (e.g. "we detected a bunch of edges of road signs" might be easier and more robust than detecting it fully), and used in a number image tasks, such as feature extraction, object/image recognition, image search, (more flexible) near-duplicate detection, and such.
The idea that you can describe an image by the collection of small things we recognize in it, and that combined presence is typically already a strong indicator (particularly when you add some hypothesis testing). Exact placement can be useful, but often easily secondary.
N-grams are contiguous sequence of length n.
They are most often seen in computational linguistics.
Applied to sequences of characters it can be useful e.g. in language identification, but the more common application is to words.
As n-grams models only include dependency information when those relations are expressed through direct proximity, they are poor language models, but useful to things working off probabilities of combinations of words, for example for statistical parsing, collocation analysis, text classification, sliding window methods (e.g. sliding window POS tagger), (statistical) machine translation, and more
For example, for the already-tokenized input This is a sentence . the 2-grams would be:
- This is
- is a
- a sentence
- sentence .
...though depending on how special you do or do not want to treat the edges, people might fake some empty tokens at the edge, or some special start/end tokens.
|This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
An extension of n-grams where components need not be consecutive (though typically stay ordered).
A k-skip-n-gram is a length-n sequence where the components occur at distance at most k from each other.
They may be used to ignore stopwords, but perhaps more often they are intended to help reduce data sparsity, under a few assumptions.
They can help discover patterns on a larger scale, simply because skipping makes you look further for the same n. (also useful for things like fuzzy hashing).
Skip-grams apparently come from speech analysis, processing phonemes.
In word-level analysis their purpose is a little different. You could say that we acknowledge the sparsity problem, and decide to get more out of the data we have (focusing on context) rather than trying to smooth.
Actually, if you go looking, skip-grams are now often equated with a fairly specific analysis.