Data modeling, restructuring, and massaging: Difference between revisions

From Helpful
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{#addbodyclass:tag_math}}
{{Math notes}}
{{Math notes}}


Line 367: Line 368:
* count
* count


One limitation is that including all words makes a humongous table {{comment|(and most cells will contain zero)}}.  
One limitation is sparsity:
including all words makes a humongous table, and most cells will contain zero.  
Yet not including them means we say they do not exist ''at all''.
Yet not including them means we say they do not exist ''at all''.


Another limitation is that counts ''as such'' are not directly usable, for dumb reasons like that  
 
if more count means more important, consider longer documents will have higher counts just because they are longer,
Another limitation is that counts ''as such'' are not directly usable,  
for simple and dumb reasons like that if more count means more important,
consider longer documents will have higher counts just because they are longer,
and that 'the' and 'a' will be most important.
and that 'the' and 'a' will be most important.




'''You could use something like [[tf-idf]]'''[https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer], an extra step on top of the previous, e.g.  
Yes, '''something like [[tf-idf]]'''[https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer] helps, an extra step on top of the previous, e.g.  
* will downweigh 'the' by merit about being absolutely everywhere
* will downweigh 'the' by merit about being absolutely everywhere
* reduces the effect of document length
* reduces the effect of document length
...but there are various known limitations with ''that''. There are many ideas like it that do a little better.




Another limitation is that that table's other axis being '' 'all' documents' is huge, and there are plenty of tasks where those original training documents (that happened to be a training task long ago) are just... not useful.
: Sure there is information in those documents.
: But we can choose to learn from it, then put it in a more compact form that is ideally just as useful?


Another limitation is that that table's other axis being '' 'all' documents' seems huge for no good reason.
There are plenty of tasks where those original training documents (that happened to be a training task long ago) are just... not useful.


Sure there is information in those documents, but can't we learn from it, then put it in a more compact form?
One thing many tried is to use something like matrix methods - factorization, dimensionality reduction, and the likes.
You don't need to know how the math works, but the point is that similarly expressed terms can be recognized as such,
and e.g. 'fish' and 'fishing' will smush into each other, automatically e.g. because otherwise-similar documents will mention both
{{comment|(looking from the perspective of the other axis, maybe we can also compare documents better, but unless comparing documents was your goal, that was actually something we were trying to get rid of)}} The reason that happens ''can'' however make this crude.




Another limitation is that these still focus on unique words. Even words that are inflections of others will be treated as entirely independent,
Another limitation is that these still focus on unique words.  
so if one document uses only 'fish' and 'spoon', it might even be treated as entirely distint from one that says only 'fishing' and 'spooning'.
Even words that are inflections of others will be treated as entirely independent,
Never that bad, but you can intuit why this isn't useful.
so if one document uses only 'fish' and 'spoon', it might even be treated as entirely distict from one that says only 'fishing' and 'spooning'.
Which is arguably a good thing - spooning is less related to spoon than fishing is to fish, and that might well be learned.
But there are a mass of words for which this is just a bulk of extra work for very little bonus, or even just extra noise. And how do you tell useful and non-useful apart?).




One thing many tried is to use something like matrix methods - factorization, dimensionality reduction, and the likes.
This suggests
You don't need to know how the math works, but the point is that similarly expressed terms can be recognized as such,
and e.g. 'fish' and 'fishing' will smush into each other, automatically because otherwise-similar documents will mention both
{{comment|(looking from the perspective of the other axis, mayne we can also compare documents better, but unless comparing documents was your goal, that was actually something we were trying to get rid of)}}
-->
-->


Line 405: Line 413:




'''"Word embeddings"''' often refer to doing that, ''but more so'': we try to some somehow figure out a dense, semantically useful vector.
There are some older distributional similarity approaches - these were a little clunky in that
they made for high dimensional, sparse vectors, where each one represented a specific context.
They were sometimes more explainable, but somewhat unwieldy.
 
 
 
'''"Word embeddings"''' often refer the word vector thing ''but more compactly'' somehow:
we try to some somehow figure out a dense, semantically useful vector.


: ''Dense'' compared to the input: when you can see text as 'one of ~100K possible words', then a vector of maybe two hundred dimensions that seems to do a good job packin enough meaning for good (dis)similarity comparisons is pretty good  
: ''Dense'' compared to the input: when you can see text as 'one of ~100K possible words', then a vector of maybe two hundred dimensions that seems to do a good job packin enough meaning for good (dis)similarity comparisons is pretty good  
Line 435: Line 450:


There are varied ways to do this.
There are varied ways to do this.


You ''could'' start with well-annotated data, and that might be of higher quality in the end,
You ''could'' start with well-annotated data, and that might be of higher quality in the end,
Line 481: Line 497:


-->
-->


=====Static embeddings=====
=====Static embeddings=====
Line 870: Line 885:


Collocations are statistically idiosyncratic sequences - the math that is often used asks  
Collocations are statistically idiosyncratic sequences - the math that is often used asks  
"do these adjacent words occur together more often than the occurrence of each individually would suggest?".
"do these adjacent words occur together more often than the occurrence of each individually would suggest?" - though ideally is a ''little'' more refined than that.


This doesn't ascribe any meaning,  
This doesn't ascribe any meaning, or comparability,
it just tends to signal anything from  
it just tends to signal anything from  
empty habitual etiquette,
jargon,
jargon,
various [[substituted phrases]],
various [[substituted phrases]],
empty habitual etiquette,
and many other things that go beyond purely [[compositional]] construction,
and many other things that go beyond purely [[compositional]] construction,
because why other than common sentence structures would they co-occur so often?
because why other than common sentence structures would they co-occur so often?


...actually, there are varied takes on how useful [[collocations]] are, and why.
...actually, there are varied takes on how useful [[collocations]] are, and why.


====latent semantic analysis====
====latent semantic analysis====

Latest revision as of 12:23, 10 May 2024

This is more for overview of my own than for teaching or exercise.

Overview of the math's areas

Arithmetic · 'elementary mathematics' and similar concepts
Set theory, Category theory
Geometry and its relatives · Topology
Elementary algebra - Linear algebra - Abstract algebra
Calculus and analysis
Logic
Semi-sorted
: Information theory · Number theory · Decision theory, game theory · Recreational mathematics · Dynamical systems · Unsorted or hard to sort


Math on data:

  • Statistics as a field
some introduction · areas of statistics
types of data · on random variables, distributions
Virtues and shortcomings of...
on sampling · probability
glossary · references, unsorted
Footnotes on various analyses


  • Other data analysis, data summarization, learning
Data modeling, restructuring, and massaging
Statistical modeling · Classification, clustering, decisions, and fuzzy coding ·
dimensionality reduction ·
Optimization theory, control theory · State observers, state estimation
Connectionism, neural nets · Evolutionary computing
  • More applied:
Formal grammars - regular expressions, CFGs, formal language
Signal analysis, modeling, processing
Image processing notes



Intro

NLP data massage / putting meanings or numbers to words

Bag of words / bag of features

The bag-of-words model (more broadly bag-of-features model) use the collection of words in a context, unordered, in a multiset, a.k.a. bag.

In other words, we summarize a document (or part of it) it by appearance or count of words, and ignore things like adjacency and order - so any grammar.



In text processing

In introductions to Naive Bayes as used for spam filtering, its naivety essentially is this assumption that feature order does not matter.


Though real-world naive bayes spam filtering would take more complex features than single words (and may re-introduce adjacenct via n-grams or such), examples often use 1-grams for simplicity - which basically is bag of words, exc.

Other types of classifiers also make this assumption, or make it easy to do so.


Bag of features

While the idea is best known from text, hence bag-of-words, you can argue for bag of features, applying it to anything you can count, and may be useful even when considered independently.

For example, you may follow up object detection in an image with logic like "if this photo contains a person, and a dog, and grass" because each task may be easy enough individually, and the combination tends to narrow down what kind of photo it is.


In practice, the bag-of-features often refers to models that recognize parts of a whole object (e.g. "we detected a bunch of edges of road signs" might be easier and more robust than detecting it fully), and used in a number image tasks, such as feature extraction, object/image recognition, image search, (more flexible) near-duplicate detection, and such.

The idea that you can describe an image by the collection of small things we recognize in it, and that combined presence is typically already a strong indicator (particularly when you add some hypothesis testing). Exact placement can be useful, but often easily secondary.


See also:

N-gram notes

N-grams are contiguous sequence of length n.


They are most often seen in computational linguistics.


Applied to sequences of characters it can be useful e.g. in language identification, but the more common application is to words.

As n-grams models only include dependency information when those relations are expressed through direct proximity, they are poor language models, but useful to things working off probabilities of combinations of words, for example for statistical parsing, collocation analysis, text classification, sliding window methods (e.g. sliding window POS tagger), (statistical) machine translation, and more


For example, for the already-tokenized input This is a sentence . the 2-grams would be:

This   is
is   a
a   sentence
sentence   .


...though depending on how special you do or do not want to treat the edges, people might fake some empty tokens at the edge, or some special start/end tokens.


Skip-grams

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Note: Skip-grams seem to refer to now two different things.


An extension of n-grams where components need not be consecutive (though typically stay ordered).


A k-skip-n-gram is a length-n sequence where the components occur at distance at most k from each other.


They may be used to ignore stopwords, but perhaps more often they are intended to help reduce data sparsity, under a few assumptions.

They can help discover patterns on a larger scale, simply because skipping makes you look further for the same n. (also useful for things like fuzzy hashing).


Skip-grams apparently come from speech analysis, processing phonemes.


In word-level analysis their purpose is a little different. You could say that we acknowledge the sparsity problem, and decide to get more out of the data we have (focusing on context) rather than trying to smooth.

Actually, if you go looking, skip-grams are now often equated with a fairly specific analysis.



Syntactic n-grams

Flexgrams

Words as features - one-hot coding and such

Putting numbers to words

Computers and people and numbers

vector space representations, word embeddings, and more

Just count in a big table
Word embeddings
Static embeddings
Contextual word embeddings
Subword embeddings
The hashing trick (also, Bloom embeddings)
Now we have nicer numbers, but how how I use them?
vectors - unsorted

Moderately specific ideas and calculations

Collocations

Collocations are statistically idiosyncratic sequences - the math that is often used asks "do these adjacent words occur together more often than the occurrence of each individually would suggest?" - though ideally is a little more refined than that.

This doesn't ascribe any meaning, or comparability, it just tends to signal anything from jargon, various substituted phrases, empty habitual etiquette, and many other things that go beyond purely compositional construction, because why other than common sentence structures would they co-occur so often?

...actually, there are varied takes on how useful collocations are, and why.

latent semantic analysis

Latent Semantic Analysis (LSA) is the application of Singular Value Decomposition on text analysis and search.


random indexing

https://en.wikipedia.org/wiki/Random_indexing


Topic modeling

Roughly the idea given documents that are about a particular topic, one would expect particular words to appear in the each more or less frequently.

Assuming such documents sharing topics, you can probably find groups of words that belong to those topics.

Assuming each document is primarily about one topic, you can expect a larger set of documents to yield multiple topics, and an assignment of one or more of these topics, so act like a soft/fuzzy clustering.

This is a relatively weak proposal in that it relies on a number of assumptions, but given that it requires zero training, it works better than you might expect when those assumptions are met. (the largest probably being your documents having singular topics).


https://en.wikipedia.org/wiki/Topic_model


word2vec

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

word2vec is one of many ways to put semantic vectors to words (in the distributional hypothesis approach), and refers to two techniques, using either bag-of-words and skip-gram as processing for a specific learner, as described in T Mikolov et al. (2013), "Efficient Estimation of Word Representations in Vector Space", probably the one that kicked off this dense-vector idea into the interest.


Word2vec amounts could be seen as building a classifier that predicts what word apear in a context, and/or what context appears around a word, which happens to do a decent task of classifying that word.


That paper mentions

  • its continuous bag of words (cbow) variant predicts the current word based on the words directly around it (ignoring order, hence bow(verify))
  • its continuous skip-gram variant predicts surrounding words given the current word.
Uses skip-grams as a concept/building block. Some people refer to this technique as just 'skip-gram' without the 'continuous',

but this may come from not really reading the paper you're copy-pasting the image from?

seems to be better at less-common words, but slower


(NN implies one-hot coding, so not small, but it turns out to be moderately efficient(verify))


GloVe