Data modeling, restructuring, and massaging: Difference between revisions

From Helpful
Jump to navigation Jump to search
(41 intermediate revisions by the same user not shown)
Line 26: Line 26:
-->
-->


=NLP data massage=
=NLP data massage / putting meanings or numbers to words=




Line 183: Line 183:


-->
-->
==Words as features  - one-hot coding and such==
<!--
'''One-hot coding'''[https://en.wikipedia.org/wiki/One-hot]
is any method that wants a unique element per feature.
This has a different meaning/implication around electronics and statistics (in particular when coding ''numbers'', where it avoids doing so with binary combinations - but this can be distracting from the wider idea).
In the context of words and NLP, that often means each word is (which itself is still a bag-of-words approach).
This often means a large, fixed-length vector that corresponds with a fixed vocabulary.
The above methods ''can'' qualify, if you are prepared to have each word or n-gram be an unique feature
Depending a little on how you transport those numbers from place to place, this may be called '''one-hot encoding''':
each unique input gets its own element in a vector / bit in a bitstream (as opposed to some entangled/combined coding, binary coding)
Useful for to input data in a non-overlapping way - even if they may be ''learned'' to overlap.
This also makes it easy to e.g. consider a document as a count of lots of one-hot vectors.
This works at all, but runs into dimensionality and sparsity problems.
From another view, all the words in a language do not have orthogonal meaning.
'''Word vector''' approaches may do something like one-hot in preparation, but do some sort of dimensionality reduction on a space of words (via linear algebra or such), to the end of getting a reduced word space in which
: co-occurence is learned, and implicitly used to remove a lot of redundancy
: you get a gliding scale of how much noise to model away
: e.g. euclidian distances resemble semantic distance.
Applied at word level, this just doesn't deal with unknown words.
Most things don't, but in this case it's not easy




==Putting meanings or numbers to words==


The specific analysis include
* Matrix factorization, e.g. Singular Value Decompsosition
:: the term Latent Semantic Analysis is often seen alongside this, or as a name for this


===Knowledge base style===
* [[n-gram]], often [[skip-gram]]


* [http://nlp.stanford.edu/projects/glove/ GloVe]
* chooses global log-bilinear regression (based on the observation that LSA tends to )


===Statistic style===




====Collocations====
Terms like Statistical Semantics seems to have been introduced by Warren Weaver, who argued that automated [[word sense disambiguation]] could be based on co-occurrence frequency of the context words near a given target word.


Collocations are statistically idiosyncratic sequences - the math that is often used asks
Statistical semantic methods have been used for word meaning similarity, word relations, keyword extraction, text cohesiveness testing, word sense disambiguation, and more.
"do these adjacent words occur together more often than the occurrence of each individually would suggest?".


This doesn't ascribe any meaning,
See also:
it just tends to signal anything from
* http://www.aclweb.org/aclwiki/index.php?title=Distributional_Hypothesis
empty habitual etiquette,
* http://www.aclweb.org/aclwiki/index.php?title=Statistical_Semantics
jargon,
* http://www.aclweb.org/aclwiki/index.php?title=Annotated_Bibliography_on_Statistical_Semantics
various [[substituted phrases]],
* http://www.aclweb.org/aclwiki/index.php?title=Computational_Semantics
and many other things that go beyond purely [[compositional]] construction,
-->
because why other than common sentence structures would they co-occur so often?


...actually, there are varied takes on how useful [[collocations]] are, and why.
<!--
* Z Harris (1954) ''Distributional structure''
* W Weaver (1955) ''Translation''
* JR Firth (1957) ''A synopsis of linguistic theory, 1930-1955''
* S McDonald, M Ramscar (2001) ''Testing the distributional hypothesis: The influence of context on judgements of semantic similarity''
* and many others
Turney
-->


====Word embeddings====
<!--
<!--
Distributional clustering - clustering based on the idea above.
Often done for things like lexical estimation/acquisition, and do so from large but real-world (relatively relation-sparse) data sets.


{{zzz|For context|
-->
=Putting numbers to words=
 
 
====Computers and people and numbers====
<!--
Where people are good at words and bad at numbers, computers are good at numbers and bad at words.
Where people are good at words and bad at numbers, computers are good at numbers and bad at words.


Line 218: Line 274:
So it makes sense to express words ''as'' numbers?
So it makes sense to express words ''as'' numbers?


...that goes a moderate way, but if you want to approximate the contained meaning ''somehow'',
That does go a moderate way, but it really matters ''how''.
that's demonstrably still too clunky.
 
 
If you just make a list enumerating, say, 1=the, 2=banana, 3=is, ..., 150343=ubiquitous
: it makes data ''smaller'' to work with, but equally clunky to do anything with other than counting {{comment|(...which has its uses - e.g. gensim's collocation analysis does such an enumeration to keep its intermediate calculations all-numeric)}}.
: but does absolutely nothing to make them ''comparable''
 
 
In a lot of cases it would be really nice if we could encode ''some'' amount of meaning.
 
But since that asks multiple difficult questions at once,
like 'how would you want to encode different kinds of things in a way that is useful later',
we are initially happy with the ability to say that 'is' is like 'be' more than 'banana'
or at least a metric of similarity, what kind of role it plays. 
 
 
There are many of approaches to this you can think of,
and almost all you can think of has been tried, to moderate success.
 
 
 
It can help to add a knowledge base of sorts
: Maybe start with an expert-of-sorts making a list of all terms you know, and recognizing them in a text.
: Maybe start making that into an going as far as to modeling that swimming is a type of locomotion, and so is walking.
 
It can help to collect statistics about what kind of sentence patterns there are,
and what kinds of words are usually where.
It can help to detect infections, like if it adds in -ness then it's probably a [[noun]] and maybe [[uncountable]],
and you can tag that even if you don't know the root word/morpheme it's on.
 
Each of these may takes a large amount of work, and this can do a single task with good accuracy,
though it may only be good at what it does because it doesn't even try other things.
 
It can help to combine methods, and some of the systems built like this perform quite decently.
 
 
 
It turns out that it can be
as relevant how precise it is at specific things,
as it is how consistent it is throughout.
 
Maybe it can extract very specific sentences with perfect accuracy,
but ignores anything it isn't sure about.
 
 
 
Fuzzier methods are also common, in part because they can be trained to give ''some'' answer for everything,
and even if they're not high quality they can be consistent.
 
Say, you can get a reasonable answer to basic (dis)similarity between all words without much up-front annotation or knowledge,
and for basically all words you can find a little context for to feed into the system.
 
 
 
-->
 
<!--
This can include
* words that appear in similar context (see also [[word embeddings]], even [[topic modelling]])
 
* words that have similar meaning
 
* Topological / ontological similarity - based on more strongly asserted properties
: Say, where many methods may put 'car', 'road', and 'driving' in the same area, this may also say roughly ''how'' they are related, in a semantic sense. This can be more precise distance-wise, except that it also more easily incomplete.
 
-->
<!--
 
Not quite central to the area, but helpful to some of the contrasts we want to make,
is '''semantic similarity''', the broad area of [[metric]]s between words, phrases, and/or documents.
 
Other people use the term semantic similarity to focus on the ontological sort of methods - and possibly specifically "is a" and other strongly [[ontology|ontological]] relationships, and ''not'' just "seems to occur in the same place". {{comment|('''Semantic relatedness''' is sometimes added to mean 'is a' plus [[antonyms]] (opposites), [[meronyms]] (part of whole), [[hyponyms]]/[[hypernyms]])}}
 
Some people use semantic similarity can refer to absolutely ''any'' metric that gives you distances.
 
https://en.wikipedia.org/wiki/Semantic_similarity
-->
 
====vector space representations, word embeddings, and more====
<!--
 
Around text, '''vector space representations''' are the the general idea
that for each word (or similar units) you can somehow figure out a dense, semantically useful vector.
 
: ''Dense'' compared to the input: when you can see text as 'one of ~100K possible words', then a vector of maybe two hundred dimensions that seems to do a good job packin enough meaning for good (dis)similarity comparisons is pretty good
 
: ''Semantically useful'' in that those properties tend to be useful
:: often focusing on such (dis)similarity comparisons - e.g. 'bat' may get a sense of animal, tool, and verb, and noun -- or rather, in this space appear close to animals and tools and verbs than e.g. 'banana' does.
 
 
There is a whole trend in, instead of using well-annotated data (which is costly to come by),
yuou can use orders more ''non-annotated'' data (much easier to come by)
based on assumptions like the [[distributional hypothesis]] makes:
that words in similar context, and training from nearby words, is enough to give a good sense of comparability.
 
Years ago that was done with things like linear algebra (see e.g. [[Latent Semantic Analysis]]),
now it is done with more complex math, and/or neural nets,
which is more polished way of training faster - though the way you handle its output is much the same.
 
 
Yes, this training is an bunch of up-front work,
but assuming you can learn it well for a target domain (and trying to learn ''all'' data often gives good ''basic'' coverage of most domains)
then many things build on top have a good basis (and do not have to deal with classical training issues like high dimensionality, sparsity, smoothing, etc).
 
 
 
'''Word embeddings''' sometimes have a much more specific meaning, a specific way of finding and using text vectors.
(There are varied definitions, and you can argue that under some, terms like 'static word embeddings' make no sense)
but the terms are mixed so much that you should probably read 'embeddings' as 'text vectors'
and figure out yourself what the implementation actually is.
 
---
 
 
 
In the context of some of the later developments, the simpler variant implmentations are considered static vectors.
 
'''Static vectors''' refer to systems where each word alwys gets the same vector.
 
That usually means:
* make vocabulary: each word gets an entry
* learn vector for each item in the vocabulary
 
Static vectors /
 
 
 
---
 
 
 
 
We previously mentioned that putting a single number on a word has issues.
 
We now point out that putting a single vector for a word have some issues.
 
 
Consider, say, "we saw the saw".
A static vector method still
cannot be expressed in numbers without those two words ''having'' to be the same thing.
"we saw a bunny" or "I sharpened the saw" will have one sense - probably both the tool quality and the seeing quality,
and any use will tell you it's slightly about tools
{{comment|(by count, most 'saw's are the seeing, something that even unsupervised learning should tell you)}}.
 
You can imagine it's not really an issue for words like antidisestablishmentarianism.
There's not a lot of varied subtle variation in its use, so you can pretend has one meaning.
 
But saw, hm.
 
 
And if you table a motion, it ''will'' associate in gestures and woodworking,
because those are the more common things
 
 
 


To illustrate one limitation, you could say that 'saw' is represented by a specific number.
Can't you just have multiple saws, encode 'saw as a verb' differently from 'saw as an implement'?
Such an enumeration means that "we saw the saw" cannot be expressed in numbers without those two words ''having'' to be the same thing.
Sure, now you can ''store'' that, but how do you decide?


You can imagine that's not a problem for, say, antidisestablishmentarianism.  
There are some further issues from inflections, but ignoring those for now,
There's not a lot of varied subtle variation in its use, so you can pretend has one meaning, one function.
a larger problem is that that the whole idea is somewhat circular,  
depending on already knowing the correct parse to learn this from.
{{comment|(also knowing which words need this separated treatment, but arguably ''that'' you can figure that out from things that end up being approximated in sufficiently distinct ways or not)}}


Can't you just have multiple saws? Sure, but how do you decide?
In reality, finding the best parse of sentence structure just isn't independent from finding its meaning.
Saw as a verb, saw as a noun? Aside from cases where variants inflect a little regularly, or not visible,
You end up having to do both concurrently.
the larger problem is that that's that's somewhat circular, depending on sort-of-already-knowing.
In reality, having a good parse of sentence structure is just ''not'' independent from its meaning,
you end up having to do both concurrently.


It turns out that a lot language, ''does'' need to be resolved by context of what they do to nearby words.
A lot language is pretty entangled, and needs to be resolved by context of what they do to nearby words - human brevity relies on some ambiguity resolving. And it turns out the most commonly used words often the weirder ones.
And it turns out the most commonly used words often the weirder ones.


This entanglement seems to help thing stay compact, with minimal ambiguity, and without requiring very strict rules,
This entanglement seems to help thing stay compact, with minimal ambiguity, and without requiring very strict rules.
which are things that natural languages seem to like to balance (even [[conlang]]s like lobjan engineer this balance),
Natural languages seem to end up balancing amount of rules/exception to reasonable levels {{comment|(even [[conlang]]s like lobjan think about this, though they engineer it explicitly)}}, and has some other uses - like [[double meanings]], and intentionally generalizing meanings.
and has some other uses - like [[double meanings]], and intentionally generalizing meanings.


So we're stuck with compact complexity.  
So we're stuck with compact complexity.
We can try to model absolutely everything we do in each language, and that might even give us something more useful in the end.
We can try to model absolutely everything we do in each language, and that might even give us something more useful in the end.




Yet imagine for a moment a system that would just pick up 'saw after a pronoun' and 'saw after a determiner',  
 
and not even because it knows what pronouns or determiners are, but because given a ''load'' of examples,
 
those are two of the things the word 'saw' happens to be next to.
 
'''Yet''' imagine for a moment a system that would just pick up 'saw after a pronoun' and 'saw after a determiner',  
and not even because it knows what pronouns or determiners are, but because given a ton of examples,
those are two of the things the word 'saw' happens to often be next to.


Such a system ''also'' doesn't have to know that it is modeling a certain verbiness and nouniness as a result.
Such a system ''also'' doesn't have to know that it is modeling a certain verbiness and nouniness as a result.
It might, from different types of context, perhaps learn that one of these contexts relates it to a certain tooliness as well.
It might, from different types of context, perhaps learn that one of these contexts relates it to a certain tooliness as well.
But, not doing that on purpose, such a system won't and ''can't'' explain such aspects.


If fact, such a system won't and ''can't'' explain such aspects.
So why mention such a system? Why do that at all?
 
So why do it?  




Well, it learns these things without us telling it anything, and the similarity it ends up helping things like:
Usually because it learns these things without us telling it anything, and the similarity it ends up helping things like:
: "if I search for walk, I also get things that mention running",  
: "if I search for walk, I also get things that mention running",  
: "this sentence probably says something about people moving"
: "this sentence probably says something about people moving"
: "walk is like run, and to a lesser degree swim",  
: "walk is like run, and to a lesser degree swim",  
: "I am making an ontology to encode these and would like some assistance to not forget things"
: "I am making an ontology style system, and would like some assistance to not forget adding related things"
 
The "without us telling it anything" -- it being an [[unsupervised]] technique -- also matters.
 
You will probably get more precise answers with the same amount of well-annotated data.
You will probably get equally good answers with less annotated data.


But the thing is that annotated data is hard and expensive, because it's a lot of work.


Also, this is an [[unsupervised]] technique. Yes, you will probably get more precise answers with the same amount of a supervised technique, i.e. with well annotated data,
And you can have endless discussions about annotation ''because'' these is ambiguity in there, so there's probably an upper limit, or even more time spent.
but it is usually harder to have a good amount of well-annotated data (that you can have endless discussions about),
and much esier to have ''tons'' of un-annotated data.  


It may pick up relations only softly, and in ways you can't easily fetch out,  
 
but it's pretty good at not missing them,
'''It's just easier to have ''a lot more''' un-annotated data''',  
meaning you don't have to annotate half the world's text with precise meaning
so even if it needs ''so much more'' text, a method that then does comparably well is certainly useful.
for it to work.  


You just feed it lots of text.
You just feed it lots of text.




XXX
This is not a solution to all of the underlying issues here. We're not even trying to solve them all,
in fact we're just going to gloss over a lot of them,
to a point where we can maybe encode multiple uses of words (and if they have a single one, great!).


There are limitations. It might pick up on more subtleties,
but like any unsupervised technique,
and tends to be better at finding things that at describing ''what'' it found.




There are limitations, some upsides that are arguably also downsides.


It might pick up on more subtleties, but like any unsupervised technique,
and tends to be better at finding things that at describing ''what'' it found.


You would get further by encoding 'saw as a verb' and 'saw as an implement' as different things,  
It may pick up relations only softly,
sure, but that would only solves how to ''store'' knowing that. Not finding out.
and in ways you can't easily extract or learn from,  
but it's pretty good at not completely missing them,
meaning you don't have to annotate half the world's text with precise meaning
for it to work.  




This is not a great introduction, because there are multiple underlying issues here,
and we aren't even going to solve them all, we will just choose to go just half a step fuzzier,
to a point where we can maybe encode multiple uses of words (and if they have a single one, great!).




Line 359: Line 576:
All of this may still apply a single vector to the same word always (sometimes called static word embeddings).
All of this may still apply a single vector to the same word always (sometimes called static word embeddings).
This is great for unambiguous content words, but less so for polysemy and  
This is great for unambiguous content words, but less so for polysemy and  
---




Line 366: Line 585:
<!--
<!--


The first attempts at word embeddings were typically static vectors,
The first attempts at word embeddings, and many since,
were static vectors,
meaning that the lookup (even if trained from something complex)
meaning that the lookup (even if trained from something complex)
always gives the same vector for the same word.
always gives the same vector for the same word.


This was mostly to keep the data manageable, and to keep it a simple and fast lookup.
This is a manageable amount of data, and to keep it a simple and fast lookup.




Line 376: Line 596:
and more details where ambiguity is low,
and more details where ambiguity is low,
but will do poorly in the specific cases where words's meaning depends on context.
but will do poorly in the specific cases where words's meaning depends on context.
''' ''Contextual'' word embeddings''', on the other hand,
learns about words ''in a sequence''.
This is still just statistics, but the model you run will give
Depending on how much context you pay attention to,
this is even a modestly decent approach to machine translation.




Line 405: Line 613:


We like the idea of one word, one meaning, but most languages ''really'' messed that one up.  
We like the idea of one word, one meaning, but most languages ''really'' messed that one up.  
''' ''Contextual'' word embeddings''', on the other hand,
learns about words ''in a sequence''.
This is still just statistics,
but the model you run will give
This is a bunch more data, a bunch more training,
and not always worth it
Depending on how much context you pay attention to,
this is even a modestly decent approach to machine translation.


-->
-->
Line 413: Line 636:
A model where a word/token can be characterized by something ''smaller'' than that exact whole word.
A model where a word/token can be characterized by something ''smaller'' than that exact whole word.


Often just because words happen to share large fragments of characters,
A technique that assigns meanings to words  
(not because of stronger analysis like good given lemmatization, strongly compositional agglutination (e.g. turkish).  
via meanings learned on subwords - which can be arbitrary fragments.
Chances are it will pick up on such strong patterns, but that's more a side effect of them indeed being regular)
 


This can also do quite well at things otherwise [[out of vocabulary]]
Say, the probably-out-of-vocabulary apploid may get a decent guess
if we learned a vector for appl from e.g. apple.


Examples:
Also, it starts dealing with misspellings a lot better.
fastText


[https://d2l.ai/chapter_natural-language-processing-pretraining/subword-embedding.html]
Understanding the language's morphology would probably do a little better,
but just share larger fragments of characters tends to do well enough,
in part because inflection, compositional agglutination (e.g. turkish)
and such are often ''largely'' regular.




-->


=====Bloom embeddings=====
Yes, this is sort of an [[n-gram]] trick,
<!--
and for that reason the data (which you ''do'' have to load to use)
can quickly explode for that reason.


A [[bloom filter]] applied to word embeddings to get better-than-nothing embeddings from something very compact.
For this reason it's often combined with [[bloom embeddings]].






https://explosion.ai/blog/bloom-embeddings


https://spacy.io/usage/v3-2#vectors
Examples:
fastText, floret,


[https://d2l.ai/chapter_natural-language-processing-pretraining/subword-embedding.html]


-->
-->


=====Bloom embeddings, a.k.a. the hash trick=====
<!--
<!--
===(something inbetween)===


====semantic folding====
An idea akin to [[bloom filter]]s applied to word embeddings.
https://en.wikipedia.org/wiki/Semantic_folding


-->


<!--
Consider a language model that should be assigning vectors.
====Hyperspace Analogue to Language====


When it sees [[out-of-vocabulary]] words, what do you do?


Do you treat them as not existing at all?
: ideally, we could do something quick and dirty that is better than nothing.


-->
Do you add just as many entries to the vocabulary?
That can be large, and more importantly, now documents can't don't share vocabs anymore, or vectors indexed by those vocabs.


===Could be either style===
Do you map all to a single unknown vector?
That's small, but makes them ''by definition'' indistinguishable.


====Semantic similarity====


<!--
If you wanted to do even a ''little'' extra contextual learning for them,
that learning would priobably want to move it in all directions, and end up doing nothing and being pointless.


Semantic similarity is the bread area of [[metric]]s between words, phrases, and/or documents.


Another option would be to
* reserve a number of entries for unknown words,


Some people use this term specifically when it is based on strongly coded meaning/semantics (ontology style),
* assign all unknowns into there somehow (in a way that ''will'' still collide
because this lets you make stronger statements,
:: via some hash trickery
contrasted with similarity based only on [[lexicographical]] details word embeddings.


...the latter is fuzzier, but also tends to give a reasonable answer to a lot of things that the more exact approach
* hope that the words that get assigned together aren't ''quite'' as conflicting as ''all at once''.




https://en.wikipedia.org/wiki/Semantic_similarity


It's a ''very'' rough,
* it is definitely better than nothing.


* with a little forethought you could sort of share these vectors between documents
:: in that the same words will map to the same entry every time




Limitations:
* when you smush things together and use what you previously learned
* when you smush things together and ''learn'', you relate unrelated things




This sort of bloom-like intermediate is also applied to subword embeddings,
because it gives a ''sliding scale'' between
'so large that it probably won't fit in RAM' and 'so smushed together it has become too fuzzy'




This can include
some
* words that appear in similar context (see also [[word embeddings]], even [[topic modelling]])
* [https://github.com/explosion/floret floret] (bloom embeddings for fastText)
 
* words that have similar meaning


* Topological / ontological similarity - based on more strongly asserted properties
* thinc's HashEmbed [https://thinc.ai/docs/api-layers#hashembed]
:: '''Semantic similarity''' may well refer more specifically to ''only'' "is a" and other [[ontology|ontological]] relationships,
and ''not'' just "seems to co-occur"
:: '''Semantic relatedness''' then might also include [[antonyms]] (opposites), [[meronyms]] (part of whole), [[hyponyms]]/[[hypernyms]]  


* spaCy’s MultiHashEmbed and HashEmbedCNN (which uses thinc's HashEmbed)


The last tries to not only know e.g. 'car' and 'road' and 'driving' are related
but also roughly ''how'' they are related, in a semantic sense.




between documents, or between
https://explosion.ai/blog/bloom-embeddings
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness


https://spacy.io/usage/v3-2#vectors
-->
-->


===The distributional hypothesis===


The distributional hypothesis is the idea that
<!--
words that are used and occur in the same contexts tend to convey similar meanings - "a word is characterized by the company it keeps".
===(something inbetween)===


====semantic folding====
https://en.wikipedia.org/wiki/Semantic_folding


This idea is known under a few names,
-->
but note that few of them really describe a technique,
or even the specific assumptions they make.


<!--
<!--
====Hyperspace Analogue to Language====


Distributional Similarity


Distributional semantics
-->
https://en.wikipedia.org/wiki/Distributional_semantics


'''Distributional similarity''' can refer to analysis to bring those out, often for the goal of figuring the relevant semantics.
===Moderately specific ideas and calculations===
 
====Collocations====


Collocations are statistically idiosyncratic sequences - the math that is often used asks
"do these adjacent words occur together more often than the occurrence of each individually would suggest?".


for example on noun-verb combinations, this can be referred to as .
This doesn't ascribe any meaning,  
This also means being able to, in that example, being able to e.g. predict noun similarity based on their likeliness of combination with the same verbs.
it just tends to signal anything from
empty habitual etiquette,
jargon,
various [[substituted phrases]],
and many other things that go beyond purely [[compositional]] construction,
because why other than common sentence structures would they co-occur so often?


-->
...actually, there are varied takes on how useful [[collocations]] are, and why.


===Moderately specific ideas and calculations===


====latent semantic analysis====
====latent semantic analysis====
Line 560: Line 801:
https://en.wikipedia.org/wiki/Topic_model
https://en.wikipedia.org/wiki/Topic_model


==Word embeddings==
<!--
Word embeddings are a wider class of techniques, the general idea being that
you can map to fairly dense feature vectors with reasonable training.
Dense as in much denser than one-word-per-value,
and also denser in the sense that a few hundred features seem to have some semantic value (if you can figure out what that is)
from relations of orders more words than that.


====word2vec====
{{stub}}


They're often trained from adjacency - so are an [[distributional similarity]] thing.
word2vec is one of many ways to put semantic vectors to words (in the [[distributional hypothesis]] approach),
Which were classically focused on linear algebra thing (see e.g. [[Latent Semantic Analysis]]) but has now grown into neural nets.
and refers to two techniques, using either [[bag-of-words]] and [[skip-gram]] as processing for a specific learner,
as described in T Mikolov et al. (2013), "{{search|Efficient Estimation of Word Representations in Vector Space}}",
probably the one that kicked off this dense-vector idea into the interest.




It's an extra job you need to do up front, but assuming you can learn it well,
Word2vec amounts could be seen as building a classifier that predicts what word apear in a context, and/or what context appears around a word,
the real learner now doesn't have to deal with ridiculous dimensionality,
which happens to do a decent task of classifying that word.
or the sparsity/smoothing that often brings in.




That paper mentions
* its continuous bag of words (cbow) variant predicts the current word based on the words directly around it (ignoring order, hence bow{{verify}})


-->
* its continuous skip-gram variant predicts surrounding words given the current word.
===Word2vec===
:: Uses [[skip-gram]]s as a concept/building block. Some people refer to this technique as just 'skip-gram' without the 'continuous',
<!--
but this may come from not really reading the paper you're copy-pasting the image from?
:: seems to be better at less-common words, but slower


word2vec tends to refer to [[cbow]] and skip-gram as modeled by neural networks.


(NN implies [[one-hot]] coding, so not small, but it turns out to be moderately efficient{{verify}})




But, significantly, you need no annotation.
<!--
 
 
https://en.wikipedia.org/wiki/Word2vec
https://en.wikipedia.org/wiki/Word2vec


https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html
https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html
Line 602: Line 837:


https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
-->
-->


 
====GloVe====
====Continuous bag of words (cbow)====
<!--
<!--
Global Vectors for Word Representation


cbow aims to create a vector-space


The GloVe paper itself compares itself with word2vec,
and concludes it consistenly performs a little better.


cbow can be considered a processing step where


cbow can also be considered an algorithm where you predict a word from a context of others


See also:
* http://nlp.stanford.edu/projects/glove/
* J Pennington et al. (2014), "GloVe: Global Vectors for Word Representation"


A word is used along with the adjacent ones in a given window size are used.
You could consider them unordered{{verify}} n-grams, though cbow seems more commonly used in neural-net approaches.
T Mikolov et al. (2013), "{{search|Efficient Estimation of Word Representations in Vector Space}}"
-->
====Continuous skip-grams====
<!--
-->
===GloVe===
<!--
Global Vectors for Word Representation
[http://nlp.stanford.edu/projects/glove/ GloVe]
-->
-->




[[Category:Math on data]]
[[Category:Math on data]]

Revision as of 11:57, 28 March 2024

This is more for overview of my own than for teaching or exercise.

Overview of the math's areas

Arithmetic · 'elementary mathematics' and similar concepts
Set theory, Category theory
Geometry and its relatives · Topology
Elementary algebra - Linear algebra - Abstract algebra
Calculus and analysis
Logic
Semi-sorted
: Information theory · Number theory · Decision theory, game theory · Recreational mathematics · Dynamical systems · Unsorted or hard to sort


Math on data:

  • Statistics as a field
some introduction · areas of statistics
types of data · on random variables, distributions
Virtues and shortcomings of...
on sampling · probability
glossary · references, unsorted
Footnotes on various analyses


  • Other data analysis, data summarization, learning
Data modeling, restructuring, and massaging
Statistical modeling · Classification, clustering, decisions, and fuzzy coding ·
dimensionality reduction ·
Optimization theory, control theory · State observers, state estimation
Connectionism, neural nets · Evolutionary computing
  • More applied:
Formal grammars - regular expressions, CFGs, formal language
Signal analysis, modeling, processing
Image processing notes



Intro

NLP data massage / putting meanings or numbers to words

Bag of words / bag of features

The bag-of-words model (more broadly bag-of-features model) use the collection of words in a context, unordered, in a multiset, a.k.a. bag.

In other words, we summarize a document (or part of it) it by appearance or count of words, and ignore things like adjacency and order - so any grammar.



In text processing

In introductions to Naive Bayes as used for spam filtering, its naivety essentially is this assumption that feature order does not matter.


Though real-world naive bayes spam filtering would take more complex features than single words (and may re-introduce adjacenct via n-grams or such), examples often use 1-grams for simplicity - which basically is bag of words, exc.

Other types of classifiers also make this assumption, or make it easy to do so.


Bag of features

While the idea is best known from text, hence bag-of-words, you can argue for bag of features, applying it to anything you can count, and may be useful even when considered independently.

For example, you may follow up object detection in an image with logic like "if this photo contains a person, and a dog, and grass" because each task may be easy enough individually, and the combination tends to narrow down what kind of photo it is.


In practice, the bag-of-features often refers to models that recognize parts of a whole object (e.g. "we detected a bunch of edges of road signs" might be easier and more robust than detecting it fully), and used in a number image tasks, such as feature extraction, object/image recognition, image search, (more flexible) near-duplicate detection, and such.

The idea that you can describe an image by the collection of small things we recognize in it, and that combined presence is typically already a strong indicator (particularly when you add some hypothesis testing). Exact placement can be useful, but often easily secondary.


See also:

N-gram notes

N-grams are contiguous sequence of length n.


They are most often seen in computational linguistics.


Applied to sequences of characters it can be useful e.g. in language identification, but the more common application is to words.

As n-grams models only include dependency information when those relations are expressed through direct proximity, they are poor language models, but useful to things working off probabilities of combinations of words, for example for statistical parsing, collocation analysis, text classification, sliding window methods (e.g. sliding window POS tagger), (statistical) machine translation, and more


For example, for the already-tokenized input This is a sentence . the 2-grams would be:

This   is
is   a
a   sentence
sentence   .


...though depending on how special you do or do not want to treat the edges, people might fake some empty tokens at the edge, or some special start/end tokens.


Skip-grams

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Note: Skip-grams seem to refer to now two different things.


An extension of n-grams where components need not be consecutive (though typically stay ordered).


A k-skip-n-gram is a length-n sequence where the components occur at distance at most k from each other.


They may be used to ignore stopwords, but perhaps more often they are intended to help reduce data sparsity, under a few assumptions.

They can help discover patterns on a larger scale, simply because skipping makes you look further for the same n. (also useful for things like fuzzy hashing).


Skip-grams apparently come from speech analysis, processing phonemes.


In word-level analysis their purpose is a little different. You could say that we acknowledge the sparsity problem, and decide to get more out of the data we have (focusing on context) rather than trying to smooth.

Actually, if you go looking, skip-grams are now often equated with a fairly specific analysis.



Syntactic n-grams

Flexgrams

Words as features - one-hot coding and such

Putting numbers to words

Computers and people and numbers

vector space representations, word embeddings, and more

Contextual word embeddings
Subword embeddings
Bloom embeddings, a.k.a. the hash trick

Moderately specific ideas and calculations

Collocations

Collocations are statistically idiosyncratic sequences - the math that is often used asks "do these adjacent words occur together more often than the occurrence of each individually would suggest?".

This doesn't ascribe any meaning, it just tends to signal anything from empty habitual etiquette, jargon, various substituted phrases, and many other things that go beyond purely compositional construction, because why other than common sentence structures would they co-occur so often?

...actually, there are varied takes on how useful collocations are, and why.


latent semantic analysis

Latent Semantic Analysis (LSA) is the application of Singular Value Decomposition on text analysis and search.


random indexing

https://en.wikipedia.org/wiki/Random_indexing


Topic modeling

Roughly the idea given documents that are about a particular topic, one would expect particular words to appear in the each more or less frequently.

Assuming such documents sharing topics, you can probably find groups of words that belong to those topics.

Assuming each document is primarily about one topic, you can expect a larger set of documents to yield multiple topics, and an assignment of one or more of these topics, so act like a soft/fuzzy clustering.

This is a relatively weak proposal in that it relies on a number of assumptions, but given that it requires zero training, it works better than you might expect when those assumptions are met. (the largest probably being your documents having singular topics).


https://en.wikipedia.org/wiki/Topic_model


word2vec

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

word2vec is one of many ways to put semantic vectors to words (in the distributional hypothesis approach), and refers to two techniques, using either bag-of-words and skip-gram as processing for a specific learner, as described in T Mikolov et al. (2013), "Efficient Estimation of Word Representations in Vector Space", probably the one that kicked off this dense-vector idea into the interest.


Word2vec amounts could be seen as building a classifier that predicts what word apear in a context, and/or what context appears around a word, which happens to do a decent task of classifying that word.


That paper mentions

  • its continuous bag of words (cbow) variant predicts the current word based on the words directly around it (ignoring order, hence bow(verify))
  • its continuous skip-gram variant predicts surrounding words given the current word.
Uses skip-grams as a concept/building block. Some people refer to this technique as just 'skip-gram' without the 'continuous',

but this may come from not really reading the paper you're copy-pasting the image from?

seems to be better at less-common words, but slower


(NN implies one-hot coding, so not small, but it turns out to be moderately efficient(verify))


GloVe