Data modeling, restructuring, and massaging: Difference between revisions

From Helpful
Jump to navigation Jump to search
(5 intermediate revisions by the same user not shown)
Line 304: Line 304:


...actually, there are varied takes on how useful [[collocations]] are, and why.
...actually, there are varied takes on how useful [[collocations]] are, and why.
====Semantic similarity====
<!--
Semantic similarity is the broad area of [[metric]]s between words, phrases, and/or documents.
In some cases, semantic similarity can refer to any metric that gives you distances between those,
based on any sort of estimation of the likeness.
Some people use the term '''semantic similarity''' to focus on the ontological sort of methods - and possibly specifically "is a" and other [[ontology|ontological]] relationships, and ''not'' just "seems to co-occur". {{comment|(and in that context, a distinction is sometimes made to '''Semantic relatedness''' which widens from is-a to also include [[antonyms]] (opposites), [[meronyms]] (part of whole), [[hyponyms]]/[[hypernyms]])}}
However, the fuzzier methods are more common, in part because you can get a reasonable answer to basic (dis)similarity between all words
without a ton of up-front work, and for more words - any that you can manage to give you some context.
This can include
* words that appear in similar context (see also [[word embeddings]], even [[topic modelling]])
* words that have similar meaning
* Topological / ontological similarity - based on more strongly asserted properties
: Say, where many methods may put 'car', 'road', and 'driving' in the same area, this may also say roughly ''how'' they are related, in a semantic sense. This can be more precise distance-wise, except that it also more easily incomplete.
https://en.wikipedia.org/wiki/Semantic_similarity
-->
====vector space representations, word embeddings====
<!--
Around text, '''vector space representations'''
are the the general idea being that you can map fragments (usually words) to fairly dense,
semantically useful feature vectors, using reasonable amounts of processing.
Dense in the sense that you need a ''lot'' fewer dimensions (usually a few hundred at most)
than there are input words (often at least tens of thousands),
while they still seem to distinguish a lot of the differences that the training test suggests.
Recently, these are often trained by quantity - ''lot'' of text - rather than a smaller set of well-annotated data.
The [[distributional hypothesis]] suggests that training from just adjacent words
Years ago that was done with linear algebra (see e.g. [[Latent Semantic Analysis]]),
now it is done with more complex math, and/or neural nets.
It's an extra job you need to do up front, but assuming you can learn it well,
the real learner now doesn't have to deal with ridiculous dimensionality,
or the sparsity/smoothing that often brings in.
-->


====Word embeddings====
====Word embeddings====
Line 309: Line 369:




Methods that have a single number or a single vector for a word have some issues.




To illustrate one limitation, you could say that 'saw' is represented by a specific number.
Consider, say, 'saw'.
Such an enumeration means that "we saw the saw" cannot be expressed in numbers without those two words ''having'' to be the same thing.
Such an enumeration means that "we saw the saw" cannot be expressed in numbers without those two words ''having'' to be the same thing.
"we saw a bunny" or "I sharpened the saw" will have one sense - probably both the tool quality and the seeing quality,
and any use will tell you it's slightly about tools
{{comment|(by count, most 'saw's are the seeing, something that even unsupervised learning should tell you)}}.
You can imagine it's not really an issue for words like antidisestablishmentarianism.
There's not a lot of varied subtle variation in its use, so you can pretend has one meaning.
But saw, hm.


You can imagine that's not a problem for, say, antidisestablishmentarianism.
Can't you just have multiple saws, encode 'saw as a verb' differently from 'saw as an implement'?
There's not a lot of varied subtle variation in its use, so you can pretend has one meaning, one function.
Sure, now you can ''store'' that, but how do you decide?


Can't you just have multiple saws? Sure, but how do you decide?
There are some further issues from inflections, but ignoring those for now,
Saw as a verb, saw as a noun? Aside from cases where variants inflect a little regularly, or not visible,
a larger problem is that that the whole idea is somewhat circular,  
the larger problem is that that's that's somewhat circular, depending on sort-of-already-knowing.  
depending on already knowing the correct parse to learn this from.
In reality, having a good parse of sentence structure is just ''not'' independent from its meaning,
{{comment|(also knowing which words need this separated treatment, but arguably ''that'' you can figure that out from things that end up being approximated in sufficiently distinct ways or not)}}
you end up having to do both concurrently.


It turns out that a lot language, ''does'' need to be resolved by context of what they do to nearby words. 
In reality, finding the best parse of sentence structure just isn't independent from finding its meaning.
And it turns out the most commonly used words often the weirder ones.
You end up having to do both concurrently.


This entanglement seems to help thing stay compact, with minimal ambiguity, and without requiring very strict rules,
A lot language is pretty entangled, and needs to be resolved by context of what they do to nearby words - human brevity relies on some ambiguity resolving. And it turns out the most commonly used words often the weirder ones.
which are things that natural languages seem to like to balance (even [[conlang]]s like lobjan engineer this balance),
and has some other uses - like [[double meanings]], and intentionally generalizing meanings.


So we're stuck with compact complexity.  
This entanglement seems to help thing stay compact, with minimal ambiguity, and without requiring very strict rules.
Natural languages seem to end up balancing amount of rules/exception to reasonable levels {{comment|(even [[conlang]]s like lobjan think about this, though they engineer it explicitly)}}, and has some other uses - like [[double meanings]], and intentionally generalizing meanings.
 
So we're stuck with compact complexity.
We can try to model absolutely everything we do in each language, and that might even give us something more useful in the end.
We can try to model absolutely everything we do in each language, and that might even give us something more useful in the end.




Yet imagine for a moment a system that would just pick up 'saw after a pronoun' and 'saw after a determiner',  
 
and not even because it knows what pronouns or determiners are, but because given a ''load'' of examples,
 
those are two of the things the word 'saw' happens to be next to.
 
'''Yet''' imagine for a moment a system that would just pick up 'saw after a pronoun' and 'saw after a determiner',  
and not even because it knows what pronouns or determiners are, but because given a ton of examples,
those are two of the things the word 'saw' happens to often be next to.


Such a system ''also'' doesn't have to know that it is modeling a certain verbiness and nouniness as a result.
Such a system ''also'' doesn't have to know that it is modeling a certain verbiness and nouniness as a result.
It might, from different types of context, perhaps learn that one of these contexts relates it to a certain tooliness as well.
It might, from different types of context, perhaps learn that one of these contexts relates it to a certain tooliness as well.
But, not doing that on purpose, such a system won't and ''can't'' explain such aspects.


If fact, such a system won't and ''can't'' explain such aspects.
So why mention such a system? Why do that at all?
 
So why do it?  




Well, it learns these things without us telling it anything, and the similarity it ends up helping things like:
Usually because it learns these things without us telling it anything, and the similarity it ends up helping things like:
: "if I search for walk, I also get things that mention running",  
: "if I search for walk, I also get things that mention running",  
: "this sentence probably says something about people moving"
: "this sentence probably says something about people moving"
: "walk is like run, and to a lesser degree swim",  
: "walk is like run, and to a lesser degree swim",  
: "I am making an ontology to encode these and would like some assistance to not forget things"
: "I am making an ontology style system, and would like some assistance to not forget adding related things"
 
The "without us telling it anything" -- it being an [[unsupervised]] technique -- also matters.
 
You will probably get more precise answers with the same amount of well-annotated data.
You will probably get equally good answers with less annotated data.
 
But the thing is that annotated data is hard and expensive, because it's a lot of work.


And you can have endless discussions about annotation ''because'' these is ambiguity in there, so there's probably an upper limit, or even more time spent.


Also, this is an [[unsupervised]] technique. Yes, you will probably get more precise answers with the same amount of a supervised technique, i.e. with well annotated data,
but it is usually harder to have a good amount of well-annotated data (that you can have endless discussions about),
and much esier to have ''tons'' of un-annotated data.


It may pick up relations only softly, and in ways you can't easily fetch out,  
'''It's just easier to have ''a lot more''' un-annotated data''',  
but it's pretty good at not missing them,
so even if it needs ''so much more'' text, a method that then does comparably well is certainly useful.
meaning you don't have to annotate half the world's text with precise meaning
for it to work.  


You just feed it lots of text.
You just feed it lots of text.




XXX
This is not a solution to all of the underlying issues here. We're not even trying to solve them all,
in fact we're just going to gloss over a lot of them,
to a point where we can maybe encode multiple uses of words (and if they have a single one, great!).


There are limitations. It might pick up on more subtleties,
but like any unsupervised technique,
and tends to be better at finding things that at describing ''what'' it found.




There are limitations, some upsides that are arguably also downsides.


It might pick up on more subtleties, but like any unsupervised technique,
and tends to be better at finding things that at describing ''what'' it found.


You would get further by encoding 'saw as a verb' and 'saw as an implement' as different things,  
It may pick up relations only softly,
sure, but that would only solves how to ''store'' knowing that. Not finding out.
and in ways you can't easily extract or learn from,  
but it's pretty good at not completely missing them,
meaning you don't have to annotate half the world's text with precise meaning
for it to work.  




This is not a great introduction, because there are multiple underlying issues here,
and we aren't even going to solve them all, we will just choose to go just half a step fuzzier,
to a point where we can maybe encode multiple uses of words (and if they have a single one, great!).




Line 549: Line 631:
-->
-->


===Could be either style===
====Some specific calculations====
=====word2vec=====
{{stub}}


====Semantic similarity====
word2vec is one of many ways to put semantic vectors to words (in the [[distributional hypothesis]] approach),
and refers to two techniques, using either [[bag-of-words]] and [[skip-gram]] as processing for a specific learner,
as described in T Mikolov et al. (2013), "{{search|Efficient Estimation of Word Representations in Vector Space}}"


<!--
That paper mentions
* its continuous bag of words (cbow) variant predicts the current word based on the words directly around it (ignoring order, hence bow{{verify}})


Semantic similarity is the bread area of [[metric]]s between words, phrases, and/or documents.
* its continuous skip-gram variant predicts surrounding words given the current word.
:: Uses [[skip-gram]]s as a concept/building block. Some people refer to this technique as just 'skip-gram' without the 'continuous',
but this may come from not really reading the paper you're copy-pasting the image from?
:: seems to be better at less-common words, but slower


The way it builds that happens to make it a decent classifier of that word,
so actually both work out as characterizing the word.


Some people use this term specifically when it is based on strongly coded meaning/semantics (ontology style),
because this lets you make stronger statements,
contrasted with similarity based only on [[lexicographical]] details word embeddings.


...the latter is fuzzier, but also tends to give a reasonable answer to a lot of things that the more exact approach
NN implies [[one-hot]] coding, so not small, but it turns out to be moderately efficient{{verify}}.




https://en.wikipedia.org/wiki/Semantic_similarity
<!--
https://en.wikipedia.org/wiki/Word2vec


https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html


https://pathmind.com/wiki/word2vec


https://www.youtube.com/watch?v=LSS_bos_TPI&vl=en


https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
-->


=====GloVe=====
<!--
Global Vectors for Word Representation




The GloVe paper itself compares itself with word2vec,
and concludes it consistenly performs a little better.


This can include
* words that appear in similar context (see also [[word embeddings]], even [[topic modelling]])


* words that have similar meaning


* Topological / ontological similarity - based on more strongly asserted properties
See also:
:: '''Semantic similarity''' may well refer more specifically to ''only'' "is a" and other [[ontology|ontological]] relationships,
* http://nlp.stanford.edu/projects/glove/
and ''not'' just "seems to co-occur"  
* J Pennington et al. (2014), "GloVe: Global Vectors for Word Representation"
:: '''Semantic relatedness''' then might also include [[antonyms]] (opposites), [[meronyms]] (part of whole), [[hyponyms]]/[[hypernyms]]


-->


The last tries to not only know e.g. 'car' and 'road' and 'driving' are related
but also roughly ''how'' they are related, in a semantic sense.


 
[[Category:Math on data]]
between documents, or between
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness
 
-->


===Moderately specific ideas and calculations===
===Moderately specific ideas and calculations===
Line 628: Line 719:


https://en.wikipedia.org/wiki/Topic_model
https://en.wikipedia.org/wiki/Topic_model
==vector space representations, word embeddings==
<!--
Around text, '''vector space representations'''
are the the general idea being that you can map fragments (usually words) to fairly dense,
semantically useful feature vectors, using reasonable amounts of processing.
Dense in the sense that you need a ''lot'' fewer dimensions (usually a few hundred at most)
than there are input words (often at least tens of thousands),
while they still seem to distinguish a lot of the differences that the training test suggests.
Recently, these are often trained by quantity - ''lot'' of text - rather than a smaller set of well-annotated data.
The [[distributional hypothesis]] suggests that training from just adjacent words
Years ago that was done with linear algebra (see e.g. [[Latent Semantic Analysis]]),
now it is done with more complex math, and/or neural nets.
It's an extra job you need to do up front, but assuming you can learn it well,
the real learner now doesn't have to deal with ridiculous dimensionality,
or the sparsity/smoothing that often brings in.
-->
===word2vec===
{{stub}}
word2vec is one of many ways to put semantic vectors to words (in the [[distributional hypothesis]] approach),
and refers to two techniques, using either [[bag-of-words]] and [[skip-gram]] as processing for a specific learner,
as described in T Mikolov et al. (2013), "{{search|Efficient Estimation of Word Representations in Vector Space}}"
That paper mentions
* its continuous bag of words (cbow) variant predicts the current word based on the words directly around it (ignoring order, hence bow{{verify}})
* its continuous skip-gram variant predicts surrounding words given the current word.
:: Uses [[skip-gram]]s as a concept/building block. Some people refer to this technique as just 'skip-gram' without the 'continuous',
but this may come from not really reading the paper you're copy-pasting the image from?
:: seems to be better at less-common words, but slower
The way it builds that happens to make it a decent classifier of that word,
so actually both work out as characterizing the word.
NN implies [[one-hot]] coding, so not small, but it turns out to be moderately efficient{{verify}}.
<!--
https://en.wikipedia.org/wiki/Word2vec
https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html
https://pathmind.com/wiki/word2vec
https://www.youtube.com/watch?v=LSS_bos_TPI&vl=en
https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
-->
===GloVe===
<!--
Global Vectors for Word Representation
The GloVe paper itself compares itself with word2vec,
and concludes it consistenly performs a little better.
See also:
* http://nlp.stanford.edu/projects/glove/
* J Pennington et al. (2014), "GloVe: Global Vectors for Word Representation"
-->
[[Category:Math on data]]

Revision as of 11:46, 25 March 2024

This is more for overview of my own than for teaching or exercise.

Overview of the math's areas

Arithmetic · 'elementary mathematics' and similar concepts
Set theory, Category theory
Geometry and its relatives · Topology
Elementary algebra - Linear algebra - Abstract algebra
Calculus and analysis
Logic
Semi-sorted
: Information theory · Number theory · Decision theory, game theory · Recreational mathematics · Dynamical systems · Unsorted or hard to sort


Math on data:

  • Statistics as a field
some introduction · areas of statistics
types of data · on random variables, distributions
Virtues and shortcomings of...
on sampling · probability
glossary · references, unsorted
Footnotes on various analyses


  • Other data analysis, data summarization, learning
Data modeling, restructuring, and massaging
Statistical modeling · Classification, clustering, decisions, and fuzzy coding ·
dimensionality reduction ·
Optimization theory, control theory · State observers, state estimation
Connectionism, neural nets · Evolutionary computing
  • More applied:
Formal grammars - regular expressions, CFGs, formal language
Signal analysis, modeling, processing
Image processing notes



Intro

NLP data massage / putting meanings or numbers to words

Bag of words / bag of features

The bag-of-words model (more broadly bag-of-features model) use the collection of words in a context, unordered, in a multiset, a.k.a. bag.

In other words, we summarize a document (or part of it) it by appearance or count of words, and ignore things like adjacency and order - so any grammar.



In text processing

In introductions to Naive Bayes as used for spam filtering, its naivety essentially is this assumption that feature order does not matter.


Though real-world naive bayes spam filtering would take more complex features than single words (and may re-introduce adjacenct via n-grams or such), examples often use 1-grams for simplicity - which basically is bag of words, exc.

Other types of classifiers also make this assumption, or make it easy to do so.


Bag of features

While the idea is best known from text, hence bag-of-words, you can argue for bag of features, applying it to anything you can count, and may be useful even when considered independently.

For example, you may follow up object detection in an image with logic like "if this photo contains a person, and a dog, and grass" because each task may be easy enough individually, and the combination tends to narrow down what kind of photo it is.


In practice, the bag-of-features often refers to models that recognize parts of a whole object (e.g. "we detected a bunch of edges of road signs" might be easier and more robust than detecting it fully), and used in a number image tasks, such as feature extraction, object/image recognition, image search, (more flexible) near-duplicate detection, and such.

The idea that you can describe an image by the collection of small things we recognize in it, and that combined presence is typically already a strong indicator (particularly when you add some hypothesis testing). Exact placement can be useful, but often easily secondary.


See also:

N-gram notes

N-grams are contiguous sequence of length n.


They are most often seen in computational linguistics.


Applied to sequences of characters it can be useful e.g. in language identification, but the more common application is to words.

As n-grams models only include dependency information when those relations are expressed through direct proximity, they are poor language models, but useful to things working off probabilities of combinations of words, for example for statistical parsing, collocation analysis, text classification, sliding window methods (e.g. sliding window POS tagger), (statistical) machine translation, and more


For example, for the already-tokenized input This is a sentence . the 2-grams would be:

This   is
is   a
a   sentence
sentence   .


...though depending on how special you do or do not want to treat the edges, people might fake some empty tokens at the edge, or some special start/end tokens.


Skip-grams

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Note: Skip-grams seem to refer to now two different things.


An extension of n-grams where components need not be consecutive (though typically stay ordered).


A k-skip-n-gram is a length-n sequence where the components occur at distance at most k from each other.


They may be used to ignore stopwords, but perhaps more often they are intended to help reduce data sparsity, under a few assumptions.

They can help discover patterns on a larger scale, simply because skipping makes you look further for the same n. (also useful for things like fuzzy hashing).


Skip-grams apparently come from speech analysis, processing phonemes.


In word-level analysis their purpose is a little different. You could say that we acknowledge the sparsity problem, and decide to get more out of the data we have (focusing on context) rather than trying to smooth.

Actually, if you go looking, skip-grams are now often equated with a fairly specific analysis.



Syntactic n-grams

Flexgrams

Words as features - one-hot coding and such

Putting numbers to words

Collocations

Collocations are statistically idiosyncratic sequences - the math that is often used asks "do these adjacent words occur together more often than the occurrence of each individually would suggest?".

This doesn't ascribe any meaning, it just tends to signal anything from empty habitual etiquette, jargon, various substituted phrases, and many other things that go beyond purely compositional construction, because why other than common sentence structures would they co-occur so often?

...actually, there are varied takes on how useful collocations are, and why.


Semantic similarity

vector space representations, word embeddings

Word embeddings

Contextual word embeddings
Subword embeddings
Bloom embeddings

Some specific calculations

word2vec
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

word2vec is one of many ways to put semantic vectors to words (in the distributional hypothesis approach), and refers to two techniques, using either bag-of-words and skip-gram as processing for a specific learner, as described in T Mikolov et al. (2013), "Efficient Estimation of Word Representations in Vector Space"

That paper mentions

  • its continuous bag of words (cbow) variant predicts the current word based on the words directly around it (ignoring order, hence bow(verify))
  • its continuous skip-gram variant predicts surrounding words given the current word.
Uses skip-grams as a concept/building block. Some people refer to this technique as just 'skip-gram' without the 'continuous',

but this may come from not really reading the paper you're copy-pasting the image from?

seems to be better at less-common words, but slower

The way it builds that happens to make it a decent classifier of that word, so actually both work out as characterizing the word.


NN implies one-hot coding, so not small, but it turns out to be moderately efficient(verify).


GloVe

Moderately specific ideas and calculations

latent semantic analysis

Latent Semantic Analysis (LSA) is the application of Singular Value Decomposition on text analysis and search.


random indexing

https://en.wikipedia.org/wiki/Random_indexing


Topic modeling

Roughly the idea given documents that are about a particular topic, one would expect particular words to appear in the each more or less frequently.

Assuming such documents sharing topics, you can probably find groups of words that belong to those topics.

Assuming each document is primarily about one topic, you can expect a larger set of documents to yield multiple topics, and an assignment of one or more of these topics, so act like a soft/fuzzy clustering.

This is a relatively weak proposal in that it relies on a number of assumptions, but given that it requires zero training, it works better than you might expect when those assumptions are met. (the largest probably being your documents having singular topics).


https://en.wikipedia.org/wiki/Topic_model