Revision as of 23:41, 27 March 2024

In other words, we summarize a document (or part of it) it by appearance or count of words, and ignore things like adjacency and order - so any grammar.

In text processing

In introductions to Naive Bayes as used for spam filtering, its naivety essentially is this assumption that feature order does not matter.

Though real-world naive bayes spam filtering would take more complex features than single words (and may re-introduce adjacenct via n-grams or such), examples often use 1-grams for simplicity - which basically is bag of words, exc.

Other types of classifiers also make this assumption, or make it easy to do so.

Bag of features

While the idea is best known from text, hence bag-of-words, you can argue for bag of features, applying it to anything you can count, and may be useful even when considered independently.

For example, you may follow up object detection in an image with logic like "if this photo contains a person, and a dog, and grass" because each task may be easy enough individually, and the combination tends to narrow down what kind of photo it is.

In practice, the bag-of-features often refers to models that recognize parts of a whole object (e.g. "we detected a bunch of edges of road signs" might be easier and more robust than detecting it fully), and used in a number image tasks, such as feature extraction, object/image recognition, image search, (more flexible) near-duplicate detection, and such.

The idea that you can describe an image by the collection of small things we recognize in it, and that combined presence is typically already a strong indicator (particularly when you add some hypothesis testing). Exact placement can be useful, but often easily secondary.

N-gram notes

N-grams are contiguous sequence of length n.

They are most often seen in computational linguistics.

Applied to sequences of characters it can be useful e.g. in language identification, but the more common application is to words.

As n-grams models only include dependency information when those relations are expressed through direct proximity, they are poor language models, but useful to things working off probabilities of combinations of words, for example for statistical parsing, collocation analysis, text classification, sliding window methods (e.g. sliding window POS tagger), (statistical) machine translation, and more

For example, for the already-tokenized input This is a sentence . the 2-grams would be:

This is

is a

a sentence

sentence .

...though depending on how special you do or do not want to treat the edges, people might fake some empty tokens at the edge, or some special start/end tokens.

Skip-grams

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Note: Skip-grams seem to refer to now two different things.

An extension of n-grams where components need not be consecutive (though typically stay ordered).

A k-skip-n-gram is a length-n sequence where the components occur at distance at most k from each other.

They may be used to ignore stopwords, but perhaps more often they are intended to help reduce data sparsity, under a few assumptions.

They can help discover patterns on a larger scale, simply because skipping makes you look further for the same n. (also useful for things like fuzzy hashing).

Skip-grams apparently come from speech analysis, processing phonemes.

In word-level analysis their purpose is a little different. You could say that we acknowledge the sparsity problem, and decide to get more out of the data we have (focusing on context) rather than trying to smooth.

Actually, if you go looking, skip-grams are now often equated with a fairly specific analysis.

Syntactic n-grams

Flexgrams

Words as features - one-hot coding and such

Putting numbers to words

Computers and people and numbers

vector space representations, word embeddings, and more

Contextual word embeddings

Subword embeddings

Bloom embeddings, a.k.a. the hash trick

Moderately specific ideas and calculations

Collocations

Collocations are statistically idiosyncratic sequences - the math that is often used asks "do these adjacent words occur together more often than the occurrence of each individually would suggest?".

This doesn't ascribe any meaning, it just tends to signal anything from empty habitual etiquette, jargon, various substituted phrases, and many other things that go beyond purely compositional construction, because why other than common sentence structures would they co-occur so often?

...actually, there are varied takes on how useful collocations are, and why.

latent semantic analysis

Latent Semantic Analysis (LSA) is the application of Singular Value Decomposition on text analysis and search.

random indexing

https://en.wikipedia.org/wiki/Random_indexing

Topic modeling

Roughly the idea given documents that are about a particular topic, one would expect particular words to appear in the each more or less frequently.

Assuming such documents sharing topics, you can probably find groups of words that belong to those topics.

Assuming each document is primarily about one topic, you can expect a larger set of documents to yield multiple topics, and an assignment of one or more of these topics, so act like a soft/fuzzy clustering.

This is a relatively weak proposal in that it relies on a number of assumptions, but given that it requires zero training, it works better than you might expect when those assumptions are met. (the largest probably being your documents having singular topics).

https://en.wikipedia.org/wiki/Topic_model

word2vec

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

word2vec is one of many ways to put semantic vectors to words (in the distributional hypothesis approach), and refers to two techniques, using either bag-of-words and skip-gram as processing for a specific learner, as described in T Mikolov et al. (2013), "Efficient Estimation of Word Representations in Vector Space"

That paper mentions

its continuous bag of words (cbow) variant predicts the current word based on the words directly around it (ignoring order, hence bow(verify))

its continuous skip-gram variant predicts surrounding words given the current word.

Uses skip-grams as a concept/building block. Some people refer to this technique as just 'skip-gram' without the 'continuous',

but this may come from not really reading the paper you're copy-pasting the image from?

seems to be better at less-common words, but slower

The way it builds that happens to make it a decent classifier of that word, so actually both work out as characterizing the word.

NN implies one-hot coding, so not small, but it turns out to be moderately efficient(verify).

GloVe

@@ Line 267: / Line 267: @@
-====Computes and people and numbers====
+====Computers and people and numbers====
 <!--
 Where people are good at words and bad at numbers, computers are good at numbers and bad at words.
@@ Line 274: / Line 274: @@
 So it makes sense to express words ''as'' numbers?
-...that goes a moderate way, but it really matters ''how''.
+That does go a moderate way, but it really matters ''how''.
@@ Line 353: / Line 353: @@
 -->
+====vector space representations, word embeddings, and more====
-====vector space representations====
 <!--
-Around text, '''vector space representations'''
+Around text, '''vector space representations''' are the the general idea
-are the the general idea being that you can somehow map fragments (usually words)
+that for each word (or similar units) you can somehow figure out a dense, semantically useful vector.
-to dense, semantically useful feature vectors, using reasonable amounts of processing.
-Dense in the sense that relatively few dimensions (usually a few hundred at most)
+: ''Dense'' in the sense that you could see the input as having as many dimensions as there are distinct words (tens of thousands at least), and compared to that, we manage to push most of the sense of our words into order of maybe two hundred dimensions.
-seem to pack in enough meaning to have the result compare different kinds of properties.
+This seems to pack in enough meaning to have the result compare different kinds of everyday properties.
-Semantically useful in that those properties tend to be useful, e.g.
+: ''Semantically useful'' in that those properties tend to be useful
-'bat' may get a sense of animal, tool, and verb, and noun, and be fairly dissimilar to 'banana'.
+:: e.g. 'bat' may get a sense of animal, tool, and verb, and noun -- or rather, in this space appear close to animals and tools and verbs than e.g. 'banana' does.
-than there are input words (often at least tens of thousands),
-while they still seem to distinguish a lot of the differences that the training test suggests.
-Semantically useful
-Vectors
-useful as a resulting data type e.g. in that you can define fairly simple ways to say how similar two vectors are.
+Recently, these are often trained by quantity - ''lot'' of text - rather than a smaller set of well-annotated data.
-Recently, these are often trained by quantity - ''lot'' of text - rather than a smaller set of well-annotated data.
+The idea (see also the [[distributional hypothesis]]) is that looking at words in similar context,
+and training from nearby words, is enough to give a good sense of comparability.
-The [[distributional hypothesis]] suggests that training from just adjacent words
 Years ago that was done with linear algebra (see e.g. [[Latent Semantic Analysis]]),
-now it is done with more complex math, and/or neural nets.
+now it is done with more complex math, and/or neural nets, which is more polished
+but the way you handle its output is much the same.
-It's an extra job you need to do up front, but assuming you can learn it well,
+Yes, this traking is an extra job you need to do up front, but assuming you can learn it well,
 the real learner now doesn't have to deal with ridiculous dimensionality,
 or the sparsity/smoothing that often brings in.
--->
+---
-====Word embeddings====
+'''word embeddings'''
-<!--
+We previously mentioned that putting a single number on a word has issues.
-Methods that have a single number or a single vector for a word have some issues.
+We now point out that putting a a single vector for a word have some issues.
@@ Line 615: / Line 606: @@
 A model where a word/token can be characterized by something ''smaller'' than that exact whole word.
-Often just because words happen to share large fragments of characters,
+A technique that assigns meanings to words
-(not because of stronger analysis like good given lemmatization, strongly compositional agglutination (e.g. turkish).
+via meanings learned on subwords - which can be arbitrary fragments.
-Chances are it will pick up on such strong patterns, but that's more a side effect of them indeed being regular)
+This can also do quite well at things otherwise [[out of vocabulary]]
+Say, the probably-out-of-vocabulary apploid may get a decent guess
+if we learned a vector for appl from e.g. apple.
+Also, it starts dealing with misspellings a lot better.
+Understanding the language's morphology would probably do a little better,
+but just share larger fragments of characters tends to do well enough,
+in part because inflection, compositional agglutination (e.g. turkish)
+and such are often ''largely'' regular.
+Yes, this is sort of an [[n-gram]] trick,
+and for that reason the data (which you ''do'' have to load to use)
+can quickly explode for that reason.
+For this reason it's often combined with [[bloom embeddings]].
 Examples:
-fastText
+fastText, floret,
 [https://d2l.ai/chapter_natural-language-processing-pretraining/subword-embedding.html]
 -->
-=====Bloom embeddings=====
+=====Bloom embeddings, a.k.a. the hash trick=====
 <!--
-A [[bloom filter]] applied to word embeddings to get better-than-nothing embeddings from something very compact.
+An idea akin to [[bloom filter]]s applied to word embeddings.
+Consider a language model that should be assigning vectors.
+When it sees [[out-of-vocabulary]] words, what do you do?
+Do you treat them as not existing at all?
+: ideally, we could do something quick and dirty that is better than nothing.
+Do you add just as many entries to the vocabulary?
+That can be large, and more importantly, now documents can't don't share vocabs anymore, or vectors indexed by those vocabs.
+Do you map all to a single unknown vector?
+That's small, but makes them ''by definition'' indistinguishable.
+If you wanted to do even a ''little'' extra contextual learning for them,
+that learning would priobably want to move it in all directions, and end up doing nothing and being pointless.
+Another option would be to
+* reserve a number of entries for unknown words,
+* assign all unknowns into there somehow (in a way that ''will'' still collide
+:: via some hash trickery
+* hope that the words that get assigned together aren't ''quite'' as conflicting as ''all at once''.
+It's a ''very'' rough,
+* it is definitely better than nothing.
+* with a little forethought you could sort of share these vectors between documents
+:: in that the same words will map to the same entry every time
+Limitations:
+* when you smush things together and use what you previously learned
+* when you smush things together and ''learn'', you relate unrelated things
+This sort of bloom-like intermediate is also applied to subword embeddings,
+because it gives a ''sliding scale'' between
+'so large that it probably won't fit in RAM' and 'so smushed together it has become too fuzzy'
+some
+* [https://github.com/explosion/floret floret] (bloom embeddings for fastText)
+* thinc's HashEmbed [https://thinc.ai/docs/api-layers#hashembed]
+* spaCy’s MultiHashEmbed and HashEmbedCNN (which uses thinc's HashEmbed)
@@ Line 638: / Line 702: @@
 https://spacy.io/usage/v3-2#vectors
+-->
--->
 <!--

Data modeling, restructuring, and massaging: Difference between revisions