Varied text processing
Type-token ratio
In analysis of text, token refers to individual words, and type to the amount of unique words.
Type-token ratio is the division of those two. This is a crude measure of the lexical complexity in text.
For example, if a 1000-word text has 250 unique words, it has a type/token ratio of 0.25, which is relatively low and suggests a simpler writing style -- or possibly legalese that happens to repeat phrases for less ambiguity than having heavily referential sentence structures.
You may wish to apply stemming or lemmatization before calculating a type-token ratio, when you want to avoid counting simple inflected variations as separate tokens.
Type-token ratios of different-sized texts are not directly comparable,
as words in text usually follows a power law type of distribution,
implying that longer texts will almost necessarily show a larger increase of tokens than of types.
One semi-brute way around this is to calculate type-token figures for same-sized chunks of the text, then average the ratios into a single figure (the best chunk size is not a clear-cut decision, though it still doesn't work too well if some documents are much shorter than the chunk size).
Type-token ratio is often one among various statistical figures/factors in text/corpus summaries,
because it's a rough heuristic for some things.
Consider plagiarism detection: if text was copied and reworded only slightly, it will have a comparable type-token ratio.
Supporting ideas
TF-IDF
Basics
tf-idf upsides and limitations
Scoring in search
Augmentations
Bag of words / bag of features
The bag-of-words model (more broadly bag-of-features model) use the collection of words in a context, unordered, in a multiset, a.k.a. bag.
In other words, we summarize a document (or part of it) it by appearance or count of words, and ignore things like adjacency and order - so any grammar.
In text processing
In introductions to Naive Bayes as used for spam filtering, its naivety essentially is this assumption that feature order does not matter.
Though real-world naive bayes spam filtering would take more complex features than single words (and may re-introduce adjacenct via n-grams or such), examples often use 1-grams for simplicity - which basically is bag of words, exc.
Other types of classifiers also make this assumption, or make it easy to do so.
Bag of features
While the idea is best known from text, hence bag-of-words, you can argue for bag of features, applying it to anything you can count, and may be useful even when considered independently.
For example, you may follow up object detection in an image with logic like "if this photo contains a person, and a dog, and grass" because each task may be easy enough individually, and the combination tends to narrow down what kind of photo it is.
In practice, the bag-of-features often refers to models that recognize parts of a whole object (e.g. "we detected a bunch of edges of road signs" might be easier and more robust than detecting it fully), and used in a number image tasks, such as feature extraction, object/image recognition, image search, (more flexible) near-duplicate detection, and such.
The idea that you can describe an image by the collection of small things we recognize in it, and that combined presence is typically already a strong indicator (particularly when you add some hypothesis testing). Exact placement can be useful, but often easily secondary.
See also:
N-gram notes
N-grams are contiguous sequence of length n.
They are most often seen in computational linguistics.
Applied to sequences of characters it can be useful e.g. in language identification,
but the more common application is to words.
As n-grams models only include dependency information when those relations are expressed through direct proximity, they are poor language models, but useful to things working off probabilities of combinations of words, for example for statistical parsing, collocation analysis, text classification, sliding window methods (e.g. sliding window POS tagger), (statistical) machine translation, and more
For example, for the already-tokenized input This is a sentence . the 2-grams would be:
- This is
- is a
- a sentence
- sentence .
...though depending on how special you do or do not want to treat the edges, people might fake some empty tokens at the edge, or some special start/end tokens.
Skip-grams
Note: Skip-grams seem to refer to now two different things.
An extension of n-grams where components need not be consecutive (though typically stay ordered).
A k-skip-n-gram is a length-n sequence where the components occur at distance at most k from each other.
They may be used to ignore stopwords, but perhaps more often they are intended to help reduce data sparsity, under a few assumptions.
They can help discover patterns on a larger scale, simply because skipping makes you look further for the same n. (also useful for things like fuzzy hashing).
Skip-grams apparently come from speech analysis, processing phonemes.
In word-level analysis their purpose is a little different. You could say that we acknowledge the sparsity problem, and decide to get more out of the data we have (focusing on context) rather than trying to smooth.
Actually, if you go looking, skip-grams are now often equated with a fairly specific analysis.
Syntactic n-grams
Flexgrams
Counting towards meaning
LSA, LSI
Latent Semantic Analysis is the application of Singular Value Decomposition on text analysis and search, primarily on words-per-document matrix sort of data (with variants, as there always are).
Frequently combined with cosine metric for distances.
It's one of various different things motivated via the distributional similarity observation (the idea that "a word is characterized by the company it keeps"), and that as a result, this analysis has some semantic value in that it would implicitly be doing some sort of collocation-style analysis and co-associations,
because part of what you do with it by feeding in real-world text is telling it about similarity and contrast in that text.
Latent Semantic Indexing (LSI) isn't so much a variant of this, as it is a use: using LSA's output to then index of text document the numbers/vectors came from. If you map queries to a vector in the same way, you can then do searches that are fuzzied in word choice.
Other uses include approximating the sense of unseen words, and assisting topic modelling.
It should be said that we now often either have better methods of doing that, or a methods we're using anyway can do these things as well.
Yet it's simpler to implement than most of them, so it's at least worth experimenting with to compare.
Topic modelling
Very roughly, topic classification and topic modelling are about grouping documents via representative topics.
In the wider sense, this includes a variation of approaches.
Specific uses might focus on finding the most useful topics, attempting to classify a group of documents, or both.
Topic classification, if mentioned at all, often refers to a semi-supervised technique, where you feed it the expected topics/classes.
A limitation of topic classification is that it is often still pre-set classification: if it turns out your topics are not precise enough, or way too messy, you need to make refine that to your actual needs, and probably re-train it.
Topic modelling often refers to a more automatic method, or the technique as a whole.
So more commonly we talk about topic modeling, an often barely-supervised method applied to a collection of text documents, and detecting words and phrases that seem to characterize/distinguish documents.
"Often", because implementations might still be as simple as a rule-based expert system or little more than some tf-idf trickery - yet recently the term points primarily at techniques that barely need supervision machine learning side, either trained with examples (not necessarily deep learning, because while that works well it tends to require significantly more training data than other ML, which can defeat the purpose of automated discovery)) and even without, e.g. trying to find similar items via latent features.
A 'topic model' could refer just to an explicitly defined set of topics, but perhaps more usually refers to topic modelling as a whole.
What it is useful for:
- sorting new documents with similar
- finding interesting phrases
What it won't do for you:
This involves new methods, including LDA, NMF,
In text analysis and linguistics you will see the concept of latent semantics and distributional similarity, with various methods using SVD (with the wider idea of LSA), though neural net approaches are now also common.
LDA sorts tokens into an as-of-yet unknown amount of topics, and makes some modelling assumptions that lead to documents having a modest set of topics, and topics a modest set of tokens.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9040385/
A document can be associated with more than one topic.
Various methods yield a vector in a space that represents topics, where you may select the most significant -- which sounds a lot like soft clustering and fuzzy clustering, or LSA, or
It uses include
- giving a quick summary on the topic contain - possibly visual
- finding similar items,
- describing the similarity of items
- anything that can benefit from
...from documents, genomic samples, to computer vision (consider e.g. augmenting object detection).
Keep in mind that methods may introduce implicit choices that may impose an upper limit to how well it is suited for each task.
See also
topic modelling often judges topic coherence - a metric of how well two or more terms belong together.
This can be split into intrinsic models (working just on the dataset) and extrinsic (results )
Apparently the currently popular metrics are
- UMass's
- intrinsic
- UCI's
- based on PMI, asking how likely it is words co-occur
- extrinsic
http://qpleple.com/topic-coherence-to-evaluate-topic-models/
https://monkeylearn.com/blog/introduction-to-topic-modeling/
So what's the difference between clustering and topic modeling?
It's more of a difference in the aim you have - topic modeling also tries to discover latent topics, whereas clustering.
Yet depending on your choice of algorithms, there isn't necessarily a lot of practical difference - both e.g. share a "it works better if you ask it for a data-appropriate number of groups". A document can be assigned to more than one topic, but that
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
Methods include Non-Negative Matrix Factorization (NMF or NNMF)
Topics
Think about topic modeling as a layperson asked to summarize by picking out intersting terms - because this also suggests the limitations.
Topics from articles, or conversations, or other types of documents, are somewhat varying tasks,
in assumptions like how, whether it comes from a known set or not, whether it comes from a specific field, how much it might change, how many there may be.
Because if you have a fixed set you want to detect within more formal language, learning the phrases associated with it is a relatively simple classification task.
...whereas detecting the topic on t
this is an
Als er al sprake zou zijn van wildgroei, dan is ontoereikende financiering van het onderwijs niet de onderliggende reden. De overheid is verantwoordelijk voor de reguliere bekostiging, waarmee scholen in staat worden gesteld te voldoen aan hun wettelijke verplichtingen. De overheidsbekostiging is toereikend om onderwijs te bieden dat voldoet aan de wettelijke eisen
- financiering #regering #onderwijs #juridish #scholen
Test summarization tasks on kamervragen, because they are already fairly specific to begin with
LDA
Latent Dirichlet Allocation
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
Putting semantic numbers to words and to more
Computers and people and numbers
vector space representations, word embeddings, and more
Just count in a big table?
Word embeddings
Static embeddings
Contextual word embeddings
Subword embeddings
The hashing trick (also, Bloom embeddings)
Now we have nicer numbers, but how how I use them?
vectors - unsorted
Moderately specific ideas and calculations
Collocations
Collocations are statistically idiosyncratic sequences - the math that is often used asks "do these adjacent words occur together more often than the occurrence of each individually would suggest?" - though ideally is a little more refined than that.
This doesn't ascribe any meaning, or comparability, it just tends to signal anything from jargon, various substituted phrases, empty habitual etiquette, and many other things that go beyond purely compositional construction, because why other than common sentence structures would they co-occur so often?
...actually, there are varied takes on how useful collocations are, and why.
latent semantic analysis
Latent Semantic Analysis (LSA) is the application of Singular Value Decomposition on text analysis and search.
random indexing
https://en.wikipedia.org/wiki/Random_indexing
Topic modeling
Roughly the idea given documents that are about a particular topic, one would expect particular words to appear in the each more or less frequently.
Assuming such documents sharing topics, you can probably find groups of words that belong to those topics.
Assuming each document is primarily about one topic, you can expect a larger set of documents to yield multiple topics, and an assignment of one or more of these topics, so act like a soft/fuzzy clustering.
This is a relatively weak proposal in that it relies on a number of assumptions, but given that it requires zero training, it works better than you might expect when those assumptions are met. (the largest probably being your documents having singular topics).
https://en.wikipedia.org/wiki/Topic_model
word2vec
See *2vec#word2vec