Data modeling, restructuring, and massaging: Difference between revisions

From Helpful
Jump to navigation Jump to search
(6 intermediate revisions by the same user not shown)
Line 266: Line 266:
=Putting numbers to words=
=Putting numbers to words=


===Knowledge base style===
<!--
Where people are good at words and bad at numbers, computers are good at numbers and bad at words.
 
 
So it makes sense to express words ''as'' numbers?
 
...that goes a moderate way, but it really matters ''how''.
 
 
If you just make a list enumerating, say, 1=the, 2=banana, 3=is,
then all it does is make data smaller to work with, but equally clunky to do anything with other than counting {{comment|(...which has its uses - e.g. gensim's collocation analysis does such an enumeration)}}.
 
 
Yet in a lot of uses you want the number(s) to indicate the contained meaning of a word, at least a little.




===Statistic style===
In theory, you can do that '''knowledge base style''', e.g. an expert-of-sorts
making a list of all terms you know, and recognizing them in a text.


This is a large amount of work for something that will never be complete.
It ''can'' be accurate at the thing it does, sometimes more so than the things about to be mentioned,
but having it perform well still takes a lot of care.


-->
====Collocations====
====Collocations====


Line 290: Line 308:
<!--
<!--


{{zzz|For context|
Where people are good at words and bad at numbers, computers are good at numbers and bad at words.




So it makes sense to express words ''as'' numbers?
...that goes a moderate way, but if you want to approximate the contained meaning ''somehow'',
that's demonstrably still too clunky.


To illustrate one limitation, you could say that 'saw' is represented by a specific number.
To illustrate one limitation, you could say that 'saw' is represented by a specific number.
Line 448: Line 460:
always gives the same vector for the same word.
always gives the same vector for the same word.


This was mostly to keep the data manageable, and to keep it a simple and fast lookup.
This is a manageable amount of data, and to keep it a simple and fast lookup.




Line 454: Line 466:
and more details where ambiguity is low,
and more details where ambiguity is low,
but will do poorly in the specific cases where words's meaning depends on context.
but will do poorly in the specific cases where words's meaning depends on context.
''' ''Contextual'' word embeddings''', on the other hand,
learns about words ''in a sequence''.
This is still just statistics, but the model you run will give
Depending on how much context you pay attention to,
this is even a modestly decent approach to machine translation.




Line 483: Line 483:


We like the idea of one word, one meaning, but most languages ''really'' messed that one up.  
We like the idea of one word, one meaning, but most languages ''really'' messed that one up.  
''' ''Contextual'' word embeddings''', on the other hand,
learns about words ''in a sequence''.
This is still just statistics,
but the model you run will give
This is a bunch more data, a bunch more training,
and not always worth it
Depending on how much context you pay attention to,
this is even a modestly decent approach to machine translation.


-->
-->
Line 613: Line 628:
https://en.wikipedia.org/wiki/Topic_model
https://en.wikipedia.org/wiki/Topic_model


==Word embeddings==
==vector space representations, word embeddings==
<!--
<!--


Word embeddings are a wider class of techniques, the general idea being that  
Around text, '''vector space representations'''
you can map to fairly dense feature vectors with reasonable training.  
are the the general idea being that you can map fragments (usually words) to fairly dense,
semantically useful feature vectors, using reasonable amounts of processing.  


Dense as in much denser than one-word-per-value,
Dense in the sense that you need a ''lot'' fewer dimensions (usually a few hundred at most)
and also denser in the sense that a few hundred features seem to have some semantic value (if you can figure out what that is)
than there are input words (often at least tens of thousands),
from relations of orders more words than that.
while they still seem to distinguish a lot of the differences that the training test suggests.




They're often trained from adjacency - so are an [[distributional similarity]] thing.
Recently, these are often trained by quantity - ''lot'' of text - rather than a smaller set of well-annotated data.
Which were classically focused on linear algebra thing (see e.g. [[Latent Semantic Analysis]]) but has now grown into neural nets.
 
The [[distributional hypothesis]] suggests that training from just adjacent words
Years ago that was done with linear algebra (see e.g. [[Latent Semantic Analysis]]),
now it is done with more complex math, and/or neural nets.




Line 632: Line 651:
or the sparsity/smoothing that often brings in.
or the sparsity/smoothing that often brings in.


-->
===word2vec===
{{stub}}


word2vec is one of many ways to put semantic vectors to words (in the [[distributional hypothesis]] approach),
and refers to two techniques, using either [[bag-of-words]] and [[skip-gram]] as processing for a specific learner,
as described in T Mikolov et al. (2013), "{{search|Efficient Estimation of Word Representations in Vector Space}}"


-->
That paper mentions
===Word2vec===
* its continuous bag of words (cbow) variant predicts the current word based on the words directly around it (ignoring order, hence bow{{verify}})
<!--


word2vec tends to refer to [[cbow]] and skip-gram as modeled by neural networks.
* its continuous skip-gram variant predicts surrounding words given the current word.
:: Uses [[skip-gram]]s as a concept/building block. Some people refer to this technique as just 'skip-gram' without the 'continuous',
but this may come from not really reading the paper you're copy-pasting the image from?
:: seems to be better at less-common words, but slower


The way it builds that happens to make it a decent classifier of that word,
so actually both work out as characterizing the word.




But, significantly, you need no annotation.
NN implies [[one-hot]] coding, so not small, but it turns out to be moderately efficient{{verify}}.




<!--
https://en.wikipedia.org/wiki/Word2vec
https://en.wikipedia.org/wiki/Word2vec


https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html
https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html
Line 655: Line 684:


https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
-->
-->


 
===GloVe===
====Continuous bag of words (cbow)====
<!--
<!--
Global Vectors for Word Representation


cbow aims to create a vector-space


The GloVe paper itself compares itself with word2vec,
and concludes it consistenly performs a little better.


cbow can be considered a processing step where


cbow can also be considered an algorithm where you predict a word from a context of others


See also:
* http://nlp.stanford.edu/projects/glove/
* J Pennington et al. (2014), "GloVe: Global Vectors for Word Representation"


A word is used along with the adjacent ones in a given window size are used.
You could consider them unordered{{verify}} n-grams, though cbow seems more commonly used in neural-net approaches.
T Mikolov et al. (2013), "{{search|Efficient Estimation of Word Representations in Vector Space}}"
-->
====Continuous skip-grams====
<!--
Uses [[skip-gram]]s as a concept/building block. Some people refer to this technique as 'skip-gram'
but this may come from not really reading the paper you're copy-pasting the image from.
Where that paper mentions
: continuous BOW (cbow) predicts the current word based on the context,
: continuous skip-gram predicts surrounding words given the current word.
The way it builds that happens to make it a decent classifier of that word.
The CBOW architecture predicts the current word based on the
context, and the Skip-gram predicts surrounding words given the current word.
T Mikolov (2013), "{{search|Efficient Estimation of Word Representations in Vector Space}}"
https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html
-->
===GloVe===
<!--
Global Vectors for Word Representation
[http://nlp.stanford.edu/projects/glove/ GloVe]
-->
-->




[[Category:Math on data]]
[[Category:Math on data]]

Revision as of 22:45, 23 March 2024

This is more for overview of my own than for teaching or exercise.

Overview of the math's areas

Arithmetic · 'elementary mathematics' and similar concepts
Set theory, Category theory
Geometry and its relatives · Topology
Elementary algebra - Linear algebra - Abstract algebra
Calculus and analysis
Logic
Semi-sorted
: Information theory · Number theory · Decision theory, game theory · Recreational mathematics · Dynamical systems · Unsorted or hard to sort


Math on data:

  • Statistics as a field
some introduction · areas of statistics
types of data · on random variables, distributions
Virtues and shortcomings of...
on sampling · probability
glossary · references, unsorted
Footnotes on various analyses


  • Other data analysis, data summarization, learning
Data modeling, restructuring, and massaging
Statistical modeling · Classification, clustering, decisions, and fuzzy coding ·
dimensionality reduction ·
Optimization theory, control theory · State observers, state estimation
Connectionism, neural nets · Evolutionary computing
  • More applied:
Formal grammars - regular expressions, CFGs, formal language
Signal analysis, modeling, processing
Image processing notes



Intro

NLP data massage / putting meanings or numbers to words

Bag of words / bag of features

The bag-of-words model (more broadly bag-of-features model) use the collection of words in a context, unordered, in a multiset, a.k.a. bag.

In other words, we summarize a document (or part of it) it by appearance or count of words, and ignore things like adjacency and order - so any grammar.



In text processing

In introductions to Naive Bayes as used for spam filtering, its naivety essentially is this assumption that feature order does not matter.


Though real-world naive bayes spam filtering would take more complex features than single words (and may re-introduce adjacenct via n-grams or such), examples often use 1-grams for simplicity - which basically is bag of words, exc.

Other types of classifiers also make this assumption, or make it easy to do so.


Bag of features

While the idea is best known from text, hence bag-of-words, you can argue for bag of features, applying it to anything you can count, and may be useful even when considered independently.

For example, you may follow up object detection in an image with logic like "if this photo contains a person, and a dog, and grass" because each task may be easy enough individually, and the combination tends to narrow down what kind of photo it is.


In practice, the bag-of-features often refers to models that recognize parts of a whole object (e.g. "we detected a bunch of edges of road signs" might be easier and more robust than detecting it fully), and used in a number image tasks, such as feature extraction, object/image recognition, image search, (more flexible) near-duplicate detection, and such.

The idea that you can describe an image by the collection of small things we recognize in it, and that combined presence is typically already a strong indicator (particularly when you add some hypothesis testing). Exact placement can be useful, but often easily secondary.


See also:

N-gram notes

N-grams are contiguous sequence of length n.


They are most often seen in computational linguistics.


Applied to sequences of characters it can be useful e.g. in language identification, but the more common application is to words.

As n-grams models only include dependency information when those relations are expressed through direct proximity, they are poor language models, but useful to things working off probabilities of combinations of words, for example for statistical parsing, collocation analysis, text classification, sliding window methods (e.g. sliding window POS tagger), (statistical) machine translation, and more


For example, for the already-tokenized input This is a sentence . the 2-grams would be:

This   is
is   a
a   sentence
sentence   .


...though depending on how special you do or do not want to treat the edges, people might fake some empty tokens at the edge, or some special start/end tokens.


Skip-grams

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Note: Skip-grams seem to refer to now two different things.


An extension of n-grams where components need not be consecutive (though typically stay ordered).


A k-skip-n-gram is a length-n sequence where the components occur at distance at most k from each other.


They may be used to ignore stopwords, but perhaps more often they are intended to help reduce data sparsity, under a few assumptions.

They can help discover patterns on a larger scale, simply because skipping makes you look further for the same n. (also useful for things like fuzzy hashing).


Skip-grams apparently come from speech analysis, processing phonemes.


In word-level analysis their purpose is a little different. You could say that we acknowledge the sparsity problem, and decide to get more out of the data we have (focusing on context) rather than trying to smooth.

Actually, if you go looking, skip-grams are now often equated with a fairly specific analysis.



Syntactic n-grams

Flexgrams

Words as features - one-hot coding and such

Putting numbers to words

Collocations

Collocations are statistically idiosyncratic sequences - the math that is often used asks "do these adjacent words occur together more often than the occurrence of each individually would suggest?".

This doesn't ascribe any meaning, it just tends to signal anything from empty habitual etiquette, jargon, various substituted phrases, and many other things that go beyond purely compositional construction, because why other than common sentence structures would they co-occur so often?

...actually, there are varied takes on how useful collocations are, and why.

Word embeddings

Contextual word embeddings
Subword embeddings
Bloom embeddings

Could be either style

Semantic similarity

Moderately specific ideas and calculations

latent semantic analysis

Latent Semantic Analysis (LSA) is the application of Singular Value Decomposition on text analysis and search.


random indexing

https://en.wikipedia.org/wiki/Random_indexing


Topic modeling

Roughly the idea given documents that are about a particular topic, one would expect particular words to appear in the each more or less frequently.

Assuming such documents sharing topics, you can probably find groups of words that belong to those topics.

Assuming each document is primarily about one topic, you can expect a larger set of documents to yield multiple topics, and an assignment of one or more of these topics, so act like a soft/fuzzy clustering.

This is a relatively weak proposal in that it relies on a number of assumptions, but given that it requires zero training, it works better than you might expect when those assumptions are met. (the largest probably being your documents having singular topics).


https://en.wikipedia.org/wiki/Topic_model

vector space representations, word embeddings

word2vec

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

word2vec is one of many ways to put semantic vectors to words (in the distributional hypothesis approach), and refers to two techniques, using either bag-of-words and skip-gram as processing for a specific learner, as described in T Mikolov et al. (2013), "Efficient Estimation of Word Representations in Vector Space"

That paper mentions

  • its continuous bag of words (cbow) variant predicts the current word based on the words directly around it (ignoring order, hence bow(verify))
  • its continuous skip-gram variant predicts surrounding words given the current word.
Uses skip-grams as a concept/building block. Some people refer to this technique as just 'skip-gram' without the 'continuous',

but this may come from not really reading the paper you're copy-pasting the image from?

seems to be better at less-common words, but slower

The way it builds that happens to make it a decent classifier of that word, so actually both work out as characterizing the word.


NN implies one-hot coding, so not small, but it turns out to be moderately efficient(verify).


GloVe