Data modeling, restructuring, and massaging: Difference between revisions

From Helpful
Jump to navigation Jump to search
mNo edit summary
 
(25 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{#addbodyclass:tag_math}}
{{Math notes}}
{{Math notes}}


Line 266: Line 267:
=Putting numbers to words=
=Putting numbers to words=


====Computers and people and numbers====
<!--
<!--
Where people are good at words and bad at numbers, computers are good at numbers and bad at words.
Where people are good at words and bad at numbers, computers are good at numbers and bad at words.
Line 272: Line 275:
So it makes sense to express words ''as'' numbers?
So it makes sense to express words ''as'' numbers?


...that goes a moderate way, but it really matters ''how''.  
That does go a moderate way, but it really matters ''how''.  




If you just make a list enumerating, say, 1=the, 2=banana, 3=is,
If you just make a list enumerating, say, 1=the, 2=banana, 3=is, ..., 150343=ubiquitous
then all it does is make data smaller to work with, but equally clunky to do anything with other than counting {{comment|(...which has its uses - e.g. gensim's collocation analysis does such an enumeration)}}.
: it makes data ''smaller'' to work with, but equally clunky to do anything with other than counting {{comment|(...which has its uses - e.g. gensim's collocation analysis does such an enumeration to keep its intermediate calculations all-numeric)}}.
: but does absolutely nothing to make them ''comparable''




Yet in a lot of uses you want the number(s) to indicate the contained meaning of a word, at least a little.
In a lot of cases it would be really nice if we could encode ''some'' amount of meaning.


But since that asks multiple difficult questions at once,
like 'how would you want to encode different kinds of things in a way that is useful later',
we are initially happy with the ability to say that 'is' is like 'be' more than 'banana'
or at least a metric of similarity, what kind of role it plays. 


In theory, you can do that '''knowledge base style''', e.g. an expert-of-sorts
making a list of all terms you know, and recognizing them in a text.


This is a large amount of work for something that will never be complete.
There are many of approaches to this you can think of,
It ''can'' be accurate at the thing it does, sometimes more so than the things about to be mentioned,
and almost all you can think of has been tried, to moderate success.
but having it perform well still takes a lot of care.
 
 
 
It can help to add a knowledge base of sorts
: Maybe start with an expert-of-sorts making a list of all terms you know, and recognizing them in a text.
: Maybe start making that into an going as far as to modeling that swimming is a type of locomotion, and so is walking.
 
It can help to collect statistics about what kind of sentence patterns there are,
and what kinds of words are usually where.
It can help to detect infections, like if it adds in -ness then it's probably a [[noun]] and maybe [[uncountable]],
and you can tag that even if you don't know the root word/morpheme it's on.
 
Each of these may takes a large amount of work, and this can do a single task with good accuracy,
though it may only be good at what it does because it doesn't even try other things.
 
It can help to combine methods, and some of the systems built like this perform quite decently.
 


-->
====Semantic similarity====


<!--
It turns out that it can be
as relevant how precise it is at specific things,
as it is how consistent it is throughout.


Semantic similarity is the broad area of [[metric]]s between words, phrases, and/or documents.
Maybe it can extract very specific sentences with perfect accuracy,
but ignores anything it isn't sure about.


In some cases, semantic similarity can refer to any metric that gives you distances between those,
based on any sort of estimation of the likeness.




Some people use the term '''semantic similarity''' to focus on the ontological sort of methods - and possibly specifically "is a" and other [[ontology|ontological]] relationships, and ''not'' just "seems to co-occur". {{comment|(and in that context, a distinction is sometimes made to '''Semantic relatedness''' which widens from is-a to also include [[antonyms]] (opposites), [[meronyms]] (part of whole), [[hyponyms]]/[[hypernyms]])}}
Fuzzier methods are also common, in part because they can be trained to give ''some'' answer for everything,
and even if they're not high quality they can be consistent.


Say, you can get a reasonable answer to basic (dis)similarity between all words without much up-front annotation or knowledge,
and for basically all words you can find a little context for to feed into the system.


However, the fuzzier methods are more common, in part because you can get a reasonable answer to basic (dis)similarity between all words
without a ton of up-front work, and for more words - any that you can manage to give you some context.




-->


<!--
This can include
This can include
* words that appear in similar context (see also [[word embeddings]], even [[topic modelling]])
* words that appear in similar context (see also [[word embeddings]], even [[topic modelling]])
Line 316: Line 341:
: Say, where many methods may put 'car', 'road', and 'driving' in the same area, this may also say roughly ''how'' they are related, in a semantic sense. This can be more precise distance-wise, except that it also more easily incomplete.
: Say, where many methods may put 'car', 'road', and 'driving' in the same area, this may also say roughly ''how'' they are related, in a semantic sense. This can be more precise distance-wise, except that it also more easily incomplete.


-->
<!--


Not quite central to the area, but helpful to some of the contrasts we want to make,
is '''semantic similarity''', the broad area of [[metric]]s between words, phrases, and/or documents.


Other people use the term semantic similarity to focus on the ontological sort of methods - and possibly specifically "is a" and other strongly [[ontology|ontological]] relationships, and ''not'' just "seems to occur in the same place". {{comment|('''Semantic relatedness''' is sometimes added to mean 'is a' plus [[antonyms]] (opposites), [[meronyms]] (part of whole), [[hyponyms]]/[[hypernyms]])}}
Some people use semantic similarity can refer to absolutely ''any'' metric that gives you distances.


https://en.wikipedia.org/wiki/Semantic_similarity
https://en.wikipedia.org/wiki/Semantic_similarity
-->
====vector space representations, word embeddings, and more====
<!--
Around text, '''vector space representations''' are the the general idea that for each word (or similar units) you
can calculate something that you an meaningfully compare to other.
-->
=====Just count in a big table=====
<!--
'''You could just count.'''[https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html] Given a collection of documents
* decide what words to include
* make a big documents-by-words table {{comment|(we like to say 'matrix' instead of 'table' when we get mathy, but it's the same concept really)}}
* count
One limitation is that including all words makes a humongous table {{comment|(and most cells will contain zero)}}.
Yet not including them means we say they do not exist ''at all''.
Another limitation is that counts ''as such'' are not directly usable, for dumb reasons like that
if more count means more important, consider longer documents will have higher counts just because they are longer,
and that 'the' and 'a' will be most important.
'''You could use something like [[tf-idf]]'''[https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer], an extra step on top of the previous, e.g.
* will downweigh 'the' by merit about being absolutely everywhere
* reduces the effect of document length


Another limitation is that that table's other axis being '' 'all' documents' seems huge for no good reason.
There are plenty of tasks where those original training documents (that happened to be a training task long ago) are just... not useful.
Sure there is information in those documents, but can't we learn from it, then put it in a more compact form?
Another limitation is that these still focus on unique words. Even words that are inflections of others will be treated as entirely independent,
so if one document uses only 'fish' and 'spoon', it might even be treated as entirely distint from one that says only 'fishing' and 'spooning'.
Never that bad, but you can intuit why this isn't useful.
One thing many tried is to use something like matrix methods - factorization, dimensionality reduction, and the likes.
You don't need to know how the math works, but the point is that similarly expressed terms can be recognized as such,
and e.g. 'fish' and 'fishing' will smush into each other, automatically because otherwise-similar documents will mention both
{{comment|(looking from the perspective of the other axis, mayne we can also compare documents better, but unless comparing documents was your goal, that was actually something we were trying to get rid of)}}
-->
-->


====vector space representations====
=====Word embeddings=====
 
<!--
<!--
The above arrived in an area where those vectors no longer represent just the words,
but contain some comparability to similar words.
There are some older distributional similarity approaches - these were a little clunky in that
they made for high dimensional, sparse vectors, where each one represented a specific context.
They were sometimes more explainable, but somewhat unwieldy.
'''"Word embeddings"''' often refer the word vector thing ''but more compactly'' somehow:
we try to some somehow figure out a dense, semantically useful vector.
: ''Dense'' compared to the input: when you can see text as 'one of ~100K possible words', then a vector of maybe two hundred dimensions that seems to do a good job packin enough meaning for good (dis)similarity comparisons is pretty good
: ''Semantically useful'' in that those properties tend to be useful
:: often focusing on such (dis)similarity comparisons - e.g. 'bat' may get a sense of animal, tool, and verb, and noun -- or rather, in this space appear close to animals and tools and verbs than e.g. 'banana' does.
This amounts to even more up-front work, so why do this dense semantic thing?
One is practical - classical methods run into some age-old machine learning problems like high dimensionality, sparsity,
and there happen to be some word embedding methods that sidestep that (by cheating, but by cheating ''pretty well'' ).
Also, putting all words in a single space lets us compare terms, sentences, and documents.
If the vectors are good, this is a good approximation of ''semantic'' similarity.
If we can agree on a basis between uses (e.g. build a reasonable set of vectors per natural language)
we might even be able to give a basic idea of e.g. what a document is about.
...that one turns out to be optimistic, for a slew of reasons.
You can often get better results out of becoming a little domain-specific.
Which them makes your vectors specific to just your system again.
'''So how do you get these dense semantic vectors?'''
There are varied ways to do this.
You ''could'' start with well-annotated data, and that might be of higher quality in the end,
but it is hard to come by annotation for many aspects over an entire language,
It's a lot of work to even try - and that's still ignoring details like contextual ambiguity,
analyses even people wouldn't quite agree on, the fact you have to impose a particular system
so if it doesn't encode something you wanted, you have to do it ''again'' later.
A recent trend is to put a little more trust in the assumptions of the [[distributional hypothesis]],
e.g. that words in similar context will be comparable,
and focus on words in context.
For which we can use non-annotated data. We need a ''lot'' more of it for comparable quality,
but people have collective produced a lot of text, internet and other.
This was done in varied ways over the years (e.g. [[Latent Semantic Analysis]] applies somewhat),
later with more complex math, and/or neural nets.
Which may just train better - the way you handle its output is much the same.
One of the techniques that kicked this off in more recent years is [[word2vec]],
which doesn't do a lot more than looking what appears in similar contexts.
(Its view is surprisingly narrow - what happens in a small window in a ''lot'' of data will tend to be more consistent.
A larger window is not only more work but often too fuzzy)
Its math is apparently {{search|Levy "Neural Word Embedding as Implicit Matrix Factorization"|fairly like the classical matrix factorization}}.


Around text, '''vector space representations'''
are the the general idea being that you can map fragments (usually words) to fairly dense,
semantically useful feature vectors, using reasonable amounts of processing.


Dense in the sense that you need a ''lot'' fewer dimensions (usually a few hundred at most)
'''Word embeddings''' often refers to learning vectors from context,
than there are input words (often at least tens of thousands),
though there are more varied meanings (some conflicting),  
while they still seem to distinguish a lot of the differences that the training test suggests.
so you may wish to read 'embeddings' as 'text vectors' and figure out yourself what the implementation actually is.




Recently, these are often trained by quantity - ''lot'' of text - rather than a smaller set of well-annotated data.


The [[distributional hypothesis]] suggests that training from just adjacent words
Note that  
Years ago that was done with linear algebra (see e.g. [[Latent Semantic Analysis]]),
* you won't really know what these vectors mean.
now it is done with more complex math, and/or neural nets.
: You can fish this out, to a degree, because you can compare vectors.
: say, if any given vector compares much better to 'hammer' or to 'see' (e.g. from examples sentences), you can start to figure out what it meant.  
 
* the thing you train does not necessarily
 




It's an extra job you need to do up front, but assuming you can learn it well,
the real learner now doesn't have to deal with ridiculous dimensionality,
or the sparsity/smoothing that often brings in.


-->
-->


====Word embeddings====
=====Static embeddings=====
<!--
<!--
In the context of some of the later developments, the simpler variant implmentations are considered static vectors.
'''Static vectors''' refer to systems where each word alwys gets the same vector.
That usually means:
* make vocabulary: each word gets an entry
* learn vector for each item in the vocabulary
'''Limits of static embeddings'''
We previously mentioned that putting a single number on a word has issues.




We now point out that putting a single vector for a word have some issues.


Methods that have a single number or a single vector for a word have some issues.
Yes, a lot of words will get a completely sensible sense.


Yet if we made an enumerated vocabulary, and each item gets a vector, then the saw in "I sharpened the saw" and "we saw a bat" ''will'' be assigned the same vector; same for the bat in "I saw a bat" and "I'll bat an eyelash".


Consider, say, 'saw'.
 
Such an enumeration means that "we saw the saw" cannot be expressed in numbers without those two words ''having'' to be the same thing.
The problem is that ''if'' that vector is treated as the semantic sense, both saws have exactly the same sense.
"we saw a bunny" or "I sharpened the saw" will have one sense - probably both the tool quality and the seeing quality,
 
...probably both the tool quality and the seeing quality,
and any use will tell you it's slightly about tools
and any use will tell you it's slightly about tools
{{comment|(by count, most 'saw's are the seeing, something that even unsupervised learning should tell you)}}.
{{comment|(by count, most 'saw's are the seeing, something that even unsupervised learning should tell you)}}.


You can imagine it's not really an issue for words like antidisestablishmentarianism.
You can imagine it's not really an issue for words like antidisestablishmentarianism.
Line 366: Line 532:


But saw, hm.
But saw, hm.
In theory, it may assign
* a vector to the verb saw that is more related to seeing than to cutting
* a vector to the noun bat that is more related to other animals than it is to sports equipment
That said, that is not a given.
: There's at least four options and it might land on any of them, particularly for a sentence in isolation
: the meanings we propose to be isolated here may get weirdly blended in training




But for similar reasons, if you {{example|table a motion}}, it ''will'' associate in gestures and woodworking, because those are the more common things.
'''"Can't you just have multiple saws, encode 'saw as a verb' differently from 'saw as an implement'?"'''


Can't you just have multiple saws, encode 'saw as a verb' differently from 'saw as an implement'?
Sure, now you can ''store'' that, but how do you decide?  
Sure, now you can ''store'' that, but how do you decide?  


Line 472: Line 650:
Say, give we spoiaued to an embedding-style parser, and it's going to guess it's a verb ''while also'' pointing out it's out of its vocabulary.
Say, give we spoiaued to an embedding-style parser, and it's going to guess it's a verb ''while also'' pointing out it's out of its vocabulary.


Note that
* you won't really know what these vectors mean.
: You can fish this out, to a degree, because you can compare vectors.
: say, if any given vector compares much better to 'hammer' or to 'see' (e.g. from examples sentences), you can start to figure out what it meant.
* the thing you train does not necessarily
In ''use'', the assigned vector is typically not dependent on the context of the current,
but the vector that was learned earlier was dependent on the context in the training data. {{verify}}
(you can call this [[distributional similarity]], that a word is characterized by the company it keeps.
These vectors come from machine learning (of varying type and complexity).
* LSA
* [[word2vec]] -
: technically patented?
: T Mikolov et al. (2013) "Efficient Estimation of Word Representations in Vector Space"
* tok2vec
* FastText
: https://fasttext.cc/
* lda2vec
: https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term=




All of this may still apply a single vector to the same word always (sometimes called static word embeddings).
This is great for unambiguous content words, but less so for polysemy and




Line 520: Line 658:
=====Contextual word embeddings=====
=====Contextual word embeddings=====
<!--
<!--


The first attempts at word embeddings, and many since,
The first attempts at word embeddings, and many since,
Line 534: Line 673:




Consider "we saw a bat".
In theory, it may assign
* a vector to the verb saw that is more related to seeing than to cutting
* a vector to the noun bat that is more related to other animals than it is to sports equipment
That said, that is not a given.
: There's at least four options and it might land on any of them, particularly for a sentence in isolation
: the meanings we propose to be isolated here may get weirdly blended in training




Line 572: Line 702:
A model where a word/token can be characterized by something ''smaller'' than that exact whole word.
A model where a word/token can be characterized by something ''smaller'' than that exact whole word.


Often just because words happen to share large fragments of characters,
A technique that assigns meanings to words
(not because of stronger analysis like good given lemmatization, strongly compositional agglutination (e.g. turkish).  
via meanings learned on subwords - which can be arbitrary fragments.
Chances are it will pick up on such strong patterns, but that's more a side effect of them indeed being regular)
 
 
This can also do quite well at things otherwise [[out of vocabulary]]
Say, the probably-out-of-vocabulary apploid may get a decent guess
if we learned a vector for appl from e.g. apple.
 
Also, it starts dealing with misspellings a lot better.
 
Understanding the language's morphology would probably do a little better,
but just share larger fragments of characters tends to do well enough,
in part because inflection, compositional agglutination (e.g. turkish)  
and such are often ''largely'' regular.
 
 
 
Yes, this is sort of an [[n-gram]] trick,  
and for that reason the data (which you ''do'' have to load to use)
can quickly explode for that reason.
 
For this reason it's often combined with [[bloom embeddings]].
 
 




Examples:
Examples:
fastText
fastText, floret,


[https://d2l.ai/chapter_natural-language-processing-pretraining/subword-embedding.html]
[https://d2l.ai/chapter_natural-language-processing-pretraining/subword-embedding.html]


-->


-->
=====The hashing trick (also, Bloom embeddings)=====


=====Bloom embeddings=====
<!--
<!--


A [[bloom filter]] applied to word embeddings to get better-than-nothing embeddings from something very compact.
The '''hashing trick''' works for everything from basic counting to contextual and sub-word embeddings -- just anywhere where you need to put a fixed bound on, and are willing to accept degrading performance beyond that.
 
'''You can use the [[hashing trick]]'''[https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer] if you want to put an upper bound on memory -- but this comes at an immediate cost that may be avoided with a more clever plan.
* squeezes all words into a fixed amount of entries
* ...almost indiscriminantly, so the more words you push into fewer entries, the less accurate it is.
* it's better than nothing given low amounts of memory, but there are usually better options
 
 
 
The hashing trick can be applied to word embeddings,
is sometimes called Bloom embeddings, because it's an idea ''akin'' to [[bloom filter]]s.
 
But probably more because it's shorter than "embeddings with the hashing trick".
 
 
 
Consider a language model that should be assigning vectors.
 
When it sees [[out-of-vocabulary]] words, what do you do?
 
Do you treat them as not existing at all?
: ideally, we could do something quick and dirty that is better than nothing.
 
Do you add just as many entries to the vocabulary?
That can be large, and more importantly, now documents can't don't share vocabs anymore, or vectors indexed by those vocabs.
 
Do you map all to a single unknown vector?
That's small, but makes them ''by definition'' indistinguishable.
 
 
If you wanted to do even a ''little'' extra contextual learning for them,
that learning would priobably want to move it in all directions, and end up doing nothing and being pointless.
 
 
Another option would be to
* reserve a number of entries for unknown words,
 
* assign all unknowns into there somehow (in a way that ''will'' still collide
:: via some hash trickery
 
* hope that the words that get assigned together aren't ''quite'' as conflicting as ''all at once''.






https://explosion.ai/blog/bloom-embeddings
It's a ''very'' rough,
* it is definitely better than nothing.
 
* with a little forethought you could sort of share these vectors between documents
:: in that the same words will map to the same entry every time
 
 
Limitations:
* when you smush things together and use what you previously learned
* when you smush things together and ''learn'', you relate unrelated things
 
 
This sort of bloom-like intermediate is also applied to subword embeddings,
because it gives a ''sliding scale'' between
'so large that it probably won't fit in RAM' and 'so smushed together it has become too fuzzy'
 
 
some
* [https://github.com/explosion/floret floret] (bloom embeddings for fastText)


https://spacy.io/usage/v3-2#vectors
* thinc's HashEmbed [https://thinc.ai/docs/api-layers#hashembed]


* spaCy’s MultiHashEmbed and HashEmbedCNN (which uses thinc's HashEmbed)


-->


<!--
===(something inbetween)===


====semantic folding====
https://explosion.ai/blog/bloom-embeddings
https://en.wikipedia.org/wiki/Semantic_folding


https://spacy.io/usage/v3-2#vectors
-->
-->


=====Now we have nicer numbers, but how how I ''use'' them?=====
<!--
<!--
====Hyperspace Analogue to Language====


* use the vectors as-is
* adapt the embeddings with your own training
:: starts with a good basis, refines for your use
:: but: only deals with tokens already in there


* there are also some ways to selectively alter vectors
:: can be useful if you want to keep sharing the underling embeddings


-->
-->


====Some specific calculations====
=====word2vec=====
{{stub}}


word2vec is one of many ways to put semantic vectors to words (in the [[distributional hypothesis]] approach),
=====vectors - unsorted=====
and refers to two techniques, using either [[bag-of-words]] and [[skip-gram]] as processing for a specific learner,
<!--
as described in T Mikolov et al. (2013), "{{search|Efficient Estimation of Word Representations in Vector Space}}"
 


That paper mentions
* its continuous bag of words (cbow) variant predicts the current word based on the words directly around it (ignoring order, hence bow{{verify}})


* its continuous skip-gram variant predicts surrounding words given the current word.
These vectors come from machine learning (of varying type and complexity).
:: Uses [[skip-gram]]s as a concept/building block. Some people refer to this technique as just 'skip-gram' without the 'continuous',
* LSA
but this may come from not really reading the paper you're copy-pasting the image from?
:: seems to be better at less-common words, but slower


The way it builds that happens to make it a decent classifier of that word,
* [[word2vec]] -
so actually both work out as characterizing the word.
: technically patented?
: T Mikolov et al. (2013) "Efficient Estimation of Word Representations in Vector Space"


* tok2vec


NN implies [[one-hot]] coding, so not small, but it turns out to be moderately efficient{{verify}}.
* FastText
: https://fasttext.cc/


* lda2vec
: https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term=


<!--
https://en.wikipedia.org/wiki/Word2vec


https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html


https://pathmind.com/wiki/word2vec


https://www.youtube.com/watch?v=LSS_bos_TPI&vl=en
All of this may still apply a single vector to the same word always (sometimes called static word embeddings).
This is great for unambiguous content words, but less so for polysemy and


https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
-->
-->


=====GloVe=====
<!--
<!--
Global Vectors for Word Representation
===(something inbetween)===


====semantic folding====
https://en.wikipedia.org/wiki/Semantic_folding


The GloVe paper itself compares itself with word2vec,
-->
and concludes it consistenly performs a little better.


<!--
====Hyperspace Analogue to Language====




See also:
* http://nlp.stanford.edu/projects/glove/
* J Pennington et al. (2014), "GloVe: Global Vectors for Word Representation"


-->
-->


===Moderately specific ideas and calculations===
====Collocations====
Collocations are statistically idiosyncratic sequences - the math that is often used asks
"do these adjacent words occur together more often than the occurrence of each individually would suggest?".


[[Category:Math on data]]
This doesn't ascribe any meaning,
it just tends to signal anything from
empty habitual etiquette,
jargon,
various [[substituted phrases]],
and many other things that go beyond purely [[compositional]] construction,
because why other than common sentence structures would they co-occur so often?
 
...actually, there are varied takes on how useful [[collocations]] are, and why.


===Moderately specific ideas and calculations===


====latent semantic analysis====
====latent semantic analysis====
Line 702: Line 923:


https://en.wikipedia.org/wiki/Topic_model
https://en.wikipedia.org/wiki/Topic_model
====word2vec====
{{stub}}
word2vec is one of many ways to put semantic vectors to words (in the [[distributional hypothesis]] approach),
and refers to two techniques, using either [[bag-of-words]] and [[skip-gram]] as processing for a specific learner,
as described in T Mikolov et al. (2013), "{{search|Efficient Estimation of Word Representations in Vector Space}}",
probably the one that kicked off this dense-vector idea into the interest.
Word2vec amounts could be seen as building a classifier that predicts what word apear in a context, and/or what context appears around a word,
which happens to do a decent task of classifying that word.
That paper mentions
* its continuous bag of words (cbow) variant predicts the current word based on the words directly around it (ignoring order, hence bow{{verify}})
* its continuous skip-gram variant predicts surrounding words given the current word.
:: Uses [[skip-gram]]s as a concept/building block. Some people refer to this technique as just 'skip-gram' without the 'continuous',
but this may come from not really reading the paper you're copy-pasting the image from?
:: seems to be better at less-common words, but slower
(NN implies [[one-hot]] coding, so not small, but it turns out to be moderately efficient{{verify}})
<!--
https://en.wikipedia.org/wiki/Word2vec
https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html
https://pathmind.com/wiki/word2vec
https://www.youtube.com/watch?v=LSS_bos_TPI&vl=en
https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
-->
====GloVe====
<!--
Global Vectors for Word Representation
The GloVe paper itself compares itself with word2vec,
and concludes it consistenly performs a little better.
See also:
* http://nlp.stanford.edu/projects/glove/
* J Pennington et al. (2014), "GloVe: Global Vectors for Word Representation"
-->
[[Category:Math on data]]

Latest revision as of 23:14, 21 April 2024

This is more for overview of my own than for teaching or exercise.

Overview of the math's areas

Arithmetic · 'elementary mathematics' and similar concepts
Set theory, Category theory
Geometry and its relatives · Topology
Elementary algebra - Linear algebra - Abstract algebra
Calculus and analysis
Logic
Semi-sorted
: Information theory · Number theory · Decision theory, game theory · Recreational mathematics · Dynamical systems · Unsorted or hard to sort


Math on data:

  • Statistics as a field
some introduction · areas of statistics
types of data · on random variables, distributions
Virtues and shortcomings of...
on sampling · probability
glossary · references, unsorted
Footnotes on various analyses


  • Other data analysis, data summarization, learning
Data modeling, restructuring, and massaging
Statistical modeling · Classification, clustering, decisions, and fuzzy coding ·
dimensionality reduction ·
Optimization theory, control theory · State observers, state estimation
Connectionism, neural nets · Evolutionary computing
  • More applied:
Formal grammars - regular expressions, CFGs, formal language
Signal analysis, modeling, processing
Image processing notes



Intro

NLP data massage / putting meanings or numbers to words

Bag of words / bag of features

The bag-of-words model (more broadly bag-of-features model) use the collection of words in a context, unordered, in a multiset, a.k.a. bag.

In other words, we summarize a document (or part of it) it by appearance or count of words, and ignore things like adjacency and order - so any grammar.



In text processing

In introductions to Naive Bayes as used for spam filtering, its naivety essentially is this assumption that feature order does not matter.


Though real-world naive bayes spam filtering would take more complex features than single words (and may re-introduce adjacenct via n-grams or such), examples often use 1-grams for simplicity - which basically is bag of words, exc.

Other types of classifiers also make this assumption, or make it easy to do so.


Bag of features

While the idea is best known from text, hence bag-of-words, you can argue for bag of features, applying it to anything you can count, and may be useful even when considered independently.

For example, you may follow up object detection in an image with logic like "if this photo contains a person, and a dog, and grass" because each task may be easy enough individually, and the combination tends to narrow down what kind of photo it is.


In practice, the bag-of-features often refers to models that recognize parts of a whole object (e.g. "we detected a bunch of edges of road signs" might be easier and more robust than detecting it fully), and used in a number image tasks, such as feature extraction, object/image recognition, image search, (more flexible) near-duplicate detection, and such.

The idea that you can describe an image by the collection of small things we recognize in it, and that combined presence is typically already a strong indicator (particularly when you add some hypothesis testing). Exact placement can be useful, but often easily secondary.


See also:

N-gram notes

N-grams are contiguous sequence of length n.


They are most often seen in computational linguistics.


Applied to sequences of characters it can be useful e.g. in language identification, but the more common application is to words.

As n-grams models only include dependency information when those relations are expressed through direct proximity, they are poor language models, but useful to things working off probabilities of combinations of words, for example for statistical parsing, collocation analysis, text classification, sliding window methods (e.g. sliding window POS tagger), (statistical) machine translation, and more


For example, for the already-tokenized input This is a sentence . the 2-grams would be:

This   is
is   a
a   sentence
sentence   .


...though depending on how special you do or do not want to treat the edges, people might fake some empty tokens at the edge, or some special start/end tokens.


Skip-grams

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Note: Skip-grams seem to refer to now two different things.


An extension of n-grams where components need not be consecutive (though typically stay ordered).


A k-skip-n-gram is a length-n sequence where the components occur at distance at most k from each other.


They may be used to ignore stopwords, but perhaps more often they are intended to help reduce data sparsity, under a few assumptions.

They can help discover patterns on a larger scale, simply because skipping makes you look further for the same n. (also useful for things like fuzzy hashing).


Skip-grams apparently come from speech analysis, processing phonemes.


In word-level analysis their purpose is a little different. You could say that we acknowledge the sparsity problem, and decide to get more out of the data we have (focusing on context) rather than trying to smooth.

Actually, if you go looking, skip-grams are now often equated with a fairly specific analysis.



Syntactic n-grams

Flexgrams

Words as features - one-hot coding and such

Putting numbers to words

Computers and people and numbers

vector space representations, word embeddings, and more

Just count in a big table
Word embeddings
Static embeddings
Contextual word embeddings
Subword embeddings
The hashing trick (also, Bloom embeddings)
Now we have nicer numbers, but how how I use them?
vectors - unsorted

Moderately specific ideas and calculations

Collocations

Collocations are statistically idiosyncratic sequences - the math that is often used asks "do these adjacent words occur together more often than the occurrence of each individually would suggest?".

This doesn't ascribe any meaning, it just tends to signal anything from empty habitual etiquette, jargon, various substituted phrases, and many other things that go beyond purely compositional construction, because why other than common sentence structures would they co-occur so often?

...actually, there are varied takes on how useful collocations are, and why.


latent semantic analysis

Latent Semantic Analysis (LSA) is the application of Singular Value Decomposition on text analysis and search.


random indexing

https://en.wikipedia.org/wiki/Random_indexing


Topic modeling

Roughly the idea given documents that are about a particular topic, one would expect particular words to appear in the each more or less frequently.

Assuming such documents sharing topics, you can probably find groups of words that belong to those topics.

Assuming each document is primarily about one topic, you can expect a larger set of documents to yield multiple topics, and an assignment of one or more of these topics, so act like a soft/fuzzy clustering.

This is a relatively weak proposal in that it relies on a number of assumptions, but given that it requires zero training, it works better than you might expect when those assumptions are met. (the largest probably being your documents having singular topics).


https://en.wikipedia.org/wiki/Topic_model


word2vec

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

word2vec is one of many ways to put semantic vectors to words (in the distributional hypothesis approach), and refers to two techniques, using either bag-of-words and skip-gram as processing for a specific learner, as described in T Mikolov et al. (2013), "Efficient Estimation of Word Representations in Vector Space", probably the one that kicked off this dense-vector idea into the interest.


Word2vec amounts could be seen as building a classifier that predicts what word apear in a context, and/or what context appears around a word, which happens to do a decent task of classifying that word.


That paper mentions

  • its continuous bag of words (cbow) variant predicts the current word based on the words directly around it (ignoring order, hence bow(verify))
  • its continuous skip-gram variant predicts surrounding words given the current word.
Uses skip-grams as a concept/building block. Some people refer to this technique as just 'skip-gram' without the 'continuous',

but this may come from not really reading the paper you're copy-pasting the image from?

seems to be better at less-common words, but slower


(NN implies one-hot coding, so not small, but it turns out to be moderately efficient(verify))


GloVe