Data modeling, restructuring, and massaging: Difference between revisions
(2 intermediate revisions by the same user not shown) | |||
Line 359: | Line 359: | ||
that for each word (or similar units) you can somehow figure out a dense, semantically useful vector. | that for each word (or similar units) you can somehow figure out a dense, semantically useful vector. | ||
: ''Dense'' | : ''Dense'' compared to the input: when you can see text as 'one of ~100K possible words', then a vector of maybe two hundred dimensions that seems to do a good job packin enough meaning for good (dis)similarity comparisons is pretty good | ||
: ''Semantically useful'' in that those properties tend to be useful | : ''Semantically useful'' in that those properties tend to be useful | ||
:: e.g. 'bat' may get a sense of animal, tool, and verb, and noun -- or rather, in this space appear close to animals and tools and verbs than e.g. 'banana' does. | :: often focusing on such (dis)similarity comparisons - e.g. 'bat' may get a sense of animal, tool, and verb, and noun -- or rather, in this space appear close to animals and tools and verbs than e.g. 'banana' does. | ||
Yes, this training is an bunch of up-front work, | |||
but assuming you can learn it well for a target domain (and trying to learn ''all'' data often gives good ''basic'' coverage of most domains) | |||
but the | then many things build on top have a good basis (and do not have to deal with classical training issues like high dimensionality, sparsity, smoothing, etc). | ||
There are varied ways to do this. | |||
One way might be to use well-annotated data, but that is constly to come by. | |||
A recent trend is to use non-annotated data, and a lot more of it. | |||
Another is to trust the assumptions of the [[distributional hypothesis]] makes, | |||
e.g. that words in similar context will be comparable, | |||
and focus on words in context. | |||
This was done in varied ways over the years (e.g. [[Latent Semantic Analysis]] applies somewhat), | |||
later with more complex math, and/or neural nets. | |||
Which may just train better - the way you handle its output is much the same. | |||
One of the techniques that kicked this off in more recent years is [[word2vec]], | |||
which doesn't do a lot more than looking what appears in similar contexts. | |||
(Its view is surprisingly narrow - what happens in a small window in a ''lot'' of data will tend to be more consistent. | |||
A larger window is not only more work but often too fuzzy) | |||
Its math is apparently {{search|Levy "Neural Word Embedding as Implicit Matrix Factorization"|fairly like the classical matrix factorization}}. | |||
'''Word embeddings''' often refers to learning vectors from context, | |||
though there are more varied meanings (some conflicting), | |||
so you may wish to read 'embeddings' as 'text vectors' and figure out yourself what the implementation actually is. | |||
--- | |||
In the context of some of the later developments, the simpler variant implmentations are considered static vectors. | |||
'''Static vectors''' refer to systems where each word alwys gets the same vector. | |||
That usually means: | |||
* make vocabulary: each word gets an entry | |||
* learn vector for each item in the vocabulary | |||
--- | |||
Using ebeddings | |||
* use the vectors as-is | |||
* adapt the embeddings with your own training | |||
:: starts with a good basis, refines for your use | |||
:: but: only deals with tokens already in there | |||
* there are also some ways to selectively alter vectors | |||
:: can be useful if you want to keep sharing the underling embeddings | |||
--- | --- | ||
We previously mentioned that putting a single number on a word has issues. | We previously mentioned that putting a single number on a word has issues. | ||
We now point out that putting | We now point out that putting a single vector for a word have some issues. | ||
Consider, say, | Consider, say, "we saw the saw". | ||
A static vector method still | |||
cannot be expressed in numbers without those two words ''having'' to be the same thing. | |||
"we saw a bunny" or "I sharpened the saw" will have one sense - probably both the tool quality and the seeing quality, | "we saw a bunny" or "I sharpened the saw" will have one sense - probably both the tool quality and the seeing quality, | ||
and any use will tell you it's slightly about tools | and any use will tell you it's slightly about tools | ||
Line 400: | Line 449: | ||
But saw, hm. | But saw, hm. | ||
And if you table a motion, it ''will'' associate in gestures and woodworking, | |||
because those are the more common things | |||
Line 548: | Line 602: | ||
All of this may still apply a single vector to the same word always (sometimes called static word embeddings). | All of this may still apply a single vector to the same word always (sometimes called static word embeddings). | ||
This is great for unambiguous content words, but less so for polysemy and | This is great for unambiguous content words, but less so for polysemy and | ||
--- | |||
Line 777: | Line 833: | ||
word2vec is one of many ways to put semantic vectors to words (in the [[distributional hypothesis]] approach), | word2vec is one of many ways to put semantic vectors to words (in the [[distributional hypothesis]] approach), | ||
and refers to two techniques, using either [[bag-of-words]] and [[skip-gram]] as processing for a specific learner, | and refers to two techniques, using either [[bag-of-words]] and [[skip-gram]] as processing for a specific learner, | ||
as described in T Mikolov et al. (2013), "{{search|Efficient Estimation of Word Representations in Vector Space}}" | as described in T Mikolov et al. (2013), "{{search|Efficient Estimation of Word Representations in Vector Space}}", | ||
probably the one that kicked off this dense-vector idea into the interest. | |||
Word2vec amounts could be seen as building a classifier that predicts what word apear in a context, and/or what context appears around a word, | |||
which happens to do a decent task of classifying that word. | |||
That paper mentions | That paper mentions | ||
Line 786: | Line 848: | ||
but this may come from not really reading the paper you're copy-pasting the image from? | but this may come from not really reading the paper you're copy-pasting the image from? | ||
:: seems to be better at less-common words, but slower | :: seems to be better at less-common words, but slower | ||
NN implies [[one-hot]] coding, so not small, but it turns out to be moderately efficient{{verify}} | (NN implies [[one-hot]] coding, so not small, but it turns out to be moderately efficient{{verify}}) | ||
Revision as of 13:46, 28 March 2024
This is more for overview of my own than for teaching or exercise.
|
Intro
NLP data massage / putting meanings or numbers to words
Bag of words / bag of features
The bag-of-words model (more broadly bag-of-features model) use the collection of words in a context, unordered, in a multiset, a.k.a. bag.
In other words, we summarize a document (or part of it) it by appearance or count of words, and ignore things like adjacency and order - so any grammar.
In text processing
In introductions to Naive Bayes as used for spam filtering, its naivety essentially is this assumption that feature order does not matter.
Though real-world naive bayes spam filtering would take more complex features than single words (and may re-introduce adjacenct via n-grams or such), examples often use 1-grams for simplicity - which basically is bag of words, exc.
Other types of classifiers also make this assumption, or make it easy to do so.
Bag of features
While the idea is best known from text, hence bag-of-words, you can argue for bag of features, applying it to anything you can count, and may be useful even when considered independently.
For example, you may follow up object detection in an image with logic like "if this photo contains a person, and a dog, and grass" because each task may be easy enough individually, and the combination tends to narrow down what kind of photo it is.
In practice, the bag-of-features often refers to models that recognize parts of a whole object (e.g. "we detected a bunch of edges of road signs" might be easier and more robust than detecting it fully), and used in a number image tasks, such as feature extraction, object/image recognition, image search, (more flexible) near-duplicate detection, and such.
The idea that you can describe an image by the collection of small things we recognize in it, and that combined presence is typically already a strong indicator (particularly when you add some hypothesis testing). Exact placement can be useful, but often easily secondary.
See also:
N-gram notes
N-grams are contiguous sequence of length n.
They are most often seen in computational linguistics.
Applied to sequences of characters it can be useful e.g. in language identification,
but the more common application is to words.
As n-grams models only include dependency information when those relations are expressed through direct proximity, they are poor language models, but useful to things working off probabilities of combinations of words, for example for statistical parsing, collocation analysis, text classification, sliding window methods (e.g. sliding window POS tagger), (statistical) machine translation, and more
For example, for the already-tokenized input This is a sentence . the 2-grams would be:
- This is
- is a
- a sentence
- sentence .
...though depending on how special you do or do not want to treat the edges, people might fake some empty tokens at the edge, or some special start/end tokens.
Skip-grams
Note: Skip-grams seem to refer to now two different things.
An extension of n-grams where components need not be consecutive (though typically stay ordered).
A k-skip-n-gram is a length-n sequence where the components occur at distance at most k from each other.
They may be used to ignore stopwords, but perhaps more often they are intended to help reduce data sparsity, under a few assumptions.
They can help discover patterns on a larger scale, simply because skipping makes you look further for the same n. (also useful for things like fuzzy hashing).
Skip-grams apparently come from speech analysis, processing phonemes.
In word-level analysis their purpose is a little different. You could say that we acknowledge the sparsity problem, and decide to get more out of the data we have (focusing on context) rather than trying to smooth.
Actually, if you go looking, skip-grams are now often equated with a fairly specific analysis.
Syntactic n-grams
Flexgrams
Words as features - one-hot coding and such
Putting numbers to words
Computers and people and numbers
vector space representations, word embeddings, and more
Contextual word embeddings
Subword embeddings
Bloom embeddings, a.k.a. the hash trick
Moderately specific ideas and calculations
Collocations
Collocations are statistically idiosyncratic sequences - the math that is often used asks "do these adjacent words occur together more often than the occurrence of each individually would suggest?".
This doesn't ascribe any meaning, it just tends to signal anything from empty habitual etiquette, jargon, various substituted phrases, and many other things that go beyond purely compositional construction, because why other than common sentence structures would they co-occur so often?
...actually, there are varied takes on how useful collocations are, and why.
latent semantic analysis
Latent Semantic Analysis (LSA) is the application of Singular Value Decomposition on text analysis and search.
random indexing
https://en.wikipedia.org/wiki/Random_indexing
Topic modeling
Roughly the idea given documents that are about a particular topic, one would expect particular words to appear in the each more or less frequently.
Assuming such documents sharing topics, you can probably find groups of words that belong to those topics.
Assuming each document is primarily about one topic, you can expect a larger set of documents to yield multiple topics, and an assignment of one or more of these topics, so act like a soft/fuzzy clustering.
This is a relatively weak proposal in that it relies on a number of assumptions, but given that it requires zero training, it works better than you might expect when those assumptions are met. (the largest probably being your documents having singular topics).
https://en.wikipedia.org/wiki/Topic_model
word2vec
word2vec is one of many ways to put semantic vectors to words (in the distributional hypothesis approach), and refers to two techniques, using either bag-of-words and skip-gram as processing for a specific learner, as described in T Mikolov et al. (2013), "Efficient Estimation of Word Representations in Vector Space", probably the one that kicked off this dense-vector idea into the interest.
Word2vec amounts could be seen as building a classifier that predicts what word apear in a context, and/or what context appears around a word,
which happens to do a decent task of classifying that word.
That paper mentions
- its continuous bag of words (cbow) variant predicts the current word based on the words directly around it (ignoring order, hence bow(verify))
- its continuous skip-gram variant predicts surrounding words given the current word.
- Uses skip-grams as a concept/building block. Some people refer to this technique as just 'skip-gram' without the 'continuous',
but this may come from not really reading the paper you're copy-pasting the image from?
- seems to be better at less-common words, but slower
(NN implies one-hot coding, so not small, but it turns out to be moderately efficient(verify))