Varied text processing

From Helpful
Revision as of 13:47, 23 October 2024 by Helpful (talk | contribs) (→‎Collocations)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Language units large and small

Marked forms of words - Inflection, Derivation, Declension, Conjugation · Diminutive, Augmentative

Groups and categories and properties of words - Syntactic and lexical categories · Grammatical cases · Correlatives · Expletives · Adjuncts

Words and meaning - Morphology · Lexicology · Semiotics · Onomasiology · Figures of speech, expressions, phraseology, etc. · Word similarity · Ambiguity · Modality ·

Segment function, interaction, reference - Clitics · Apposition· Parataxis, Hypotaxis· Attributive· Binding · Coordinations · Word and concept reference

Sentence structure and style - Agreement · Ellipsis· Hedging

Phonology - Articulation · Formants· Prosody · Sound change · Intonation, stress, focus · Diphones · Intervocalic · Glottal stop · Vowel_diagrams · Elision · Ablaut_and_umlaut · Phonics

Analyses, models, processing, software - Minimal pairs · Concordances · Linguistics software · Some_relatively_basic_text_processing · Word embeddings · Semantic similarity ·· Speech processing · Praat notes · Praat plugins and toolkit notes · Praat scripting notes

Unsorted - Contextualism · · Text summarization · Accent, Dialect, Language · Pidgin, Creole · Natural language typology · Writing_systems · Typography, orthography · Digraphs, ligatures, dipthongs · More linguistic terms and descriptions · Phonetic scripts


Type-token ratio

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

In analysis of text, token refers to individual words, and type to the amount of unique words.

Type-token ratio is the division of those two. This is a crude measure of the lexical complexity in text.

For example, if a 1000-word text has 250 unique words, it has a type/token ratio of 0.25, which is relatively low and suggests a simpler writing style -- or possibly legalese that happens to repeat phrases for less ambiguity than having heavily referential sentence structures.


You may wish to apply stemming or lemmatization before calculating a type-token ratio, when you want to avoid counting simple inflected variations as separate tokens.


Type-token ratios of different-sized texts are not directly comparable, as words in text usually follows a power law type of distribution, implying that longer texts will almost necessarily show a larger increase of tokens than of types.

One semi-brute way around this is to calculate type-token figures for same-sized chunks of the text, then average the ratios into a single figure (the best chunk size is not a clear-cut decision, though it still doesn't work too well if some documents are much shorter than the chunk size).


Type-token ratio is often one among various statistical figures/factors in text/corpus summaries, because it's a rough heuristic for some things.

Consider plagiarism detection: if text was copied and reworded only slightly, it will have a comparable type-token ratio.


Supporting ideas

TF-IDF

Basics

tf-idf upsides and limitations

Scoring in search

Augmentations

Bag of words / bag of features

The bag-of-words model (more broadly bag-of-features model) use the collection of words in a context, unordered, in a multiset, a.k.a. bag.

In other words, we summarize a document (or part of it) it by appearance or count of words, and ignore things like adjacency and order - so any grammar.



In text processing

In introductions to Naive Bayes as used for spam filtering, its naivety essentially is this assumption that feature order does not matter.


Though real-world naive bayes spam filtering would take more complex features than single words (and may re-introduce adjacenct via n-grams or such), examples often use 1-grams for simplicity - which basically is bag of words, exc.

Other types of classifiers also make this assumption, or make it easy to do so.


Bag of features

While the idea is best known from text, hence bag-of-words, you can argue for bag of features, applying it to anything you can count, and may be useful even when considered independently.

For example, you may follow up object detection in an image with logic like "if this photo contains a person, and a dog, and grass" because each task may be easy enough individually, and the combination tends to narrow down what kind of photo it is.


In practice, the bag-of-features often refers to models that recognize parts of a whole object (e.g. "we detected a bunch of edges of road signs" might be easier and more robust than detecting it fully), and used in a number image tasks, such as feature extraction, object/image recognition, image search, (more flexible) near-duplicate detection, and such.

The idea that you can describe an image by the collection of small things we recognize in it, and that combined presence is typically already a strong indicator (particularly when you add some hypothesis testing). Exact placement can be useful, but often easily secondary.


See also:

N-gram notes

N-grams are contiguous sequence of length n.


They are most often seen in computational linguistics.


Applied to sequences of characters it can be useful e.g. in language identification, but the more common application is to words.

As n-grams models only include dependency information when those relations are expressed through direct proximity, they are poor language models, but useful to things working off probabilities of combinations of words, for example for statistical parsing, collocation analysis, text classification, sliding window methods (e.g. sliding window POS tagger), (statistical) machine translation, and more


For example, for the already-tokenized input This is a sentence . the 2-grams would be:

This   is
is   a
a   sentence
sentence   .


...though depending on how special you do or do not want to treat the edges, people might fake some empty tokens at the edge, or some special start/end tokens.


Skip-grams

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Note: Skip-grams seem to refer to now two different things.


An extension of n-grams where components need not be consecutive (though typically stay ordered).


A k-skip-n-gram is a length-n sequence where the components occur at distance at most k from each other.


They may be used to ignore stopwords, but perhaps more often they are intended to help reduce data sparsity, under a few assumptions.

They can help discover patterns on a larger scale, simply because skipping makes you look further for the same n. (also useful for things like fuzzy hashing).


Skip-grams apparently come from speech analysis, processing phonemes.


In word-level analysis their purpose is a little different. You could say that we acknowledge the sparsity problem, and decide to get more out of the data we have (focusing on context) rather than trying to smooth.

Actually, if you go looking, skip-grams are now often equated with a fairly specific analysis.



Syntactic n-grams

Flexgrams

Counting towards meaning

LSA, LSI

Latent Semantic Analysis is the application of Singular Value Decomposition on text analysis and search, primarily on words-per-document matrix sort of data (with variants, as there always are).

Frequently combined with cosine metric for distances.


It's one of various different things motivated via the distributional similarity observation (the idea that "a word is characterized by the company it keeps"), and that as a result, this analysis has some semantic value in that it would implicitly be doing some sort of collocation-style analysis and co-associations, because part of what you do with it by feeding in real-world text is telling it about similarity and contrast in that text.


Latent Semantic Indexing (LSI) isn't so much a variant of this, as it is a use: using LSA's output to then index of text document the numbers/vectors came from. If you map queries to a vector in the same way, you can then do searches that are fuzzied in word choice.


Other uses include approximating the sense of unseen words, and assisting topic modelling.

It should be said that we now often either have better methods of doing that, or a methods we're using anyway can do these things as well.


Yet it's simpler to implement than most of them, so it's at least worth experimenting with to compare.


Topic modelling

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Very roughly, topic classification and topic modelling are about grouping documents via representative topics.


In the wider sense, this includes a variation of approaches.

Specific uses might focus on finding the most useful topics, attempting to classify a group of documents, or both.


Topic classification, if mentioned at all, often refers to a semi-supervised technique, where you feed it the expected topics/classes.

A limitation of topic classification is that it is often still pre-set classification: if it turns out your topics are not precise enough, or way too messy, you need to make refine that to your actual needs, and probably re-train it.


Topic modelling often refers to a more automatic method, or the technique as a whole.

So more commonly we talk about topic modeling, an often barely-supervised method applied to a collection of text documents, and detecting words and phrases that seem to characterize/distinguish documents.

"Often", because implementations might still be as simple as a rule-based expert system or little more than some tf-idf trickery - yet recently the term points primarily at techniques that barely need supervision machine learning side, either trained with examples (not necessarily deep learning, because while that works well it tends to require significantly more training data than other ML, which can defeat the purpose of automated discovery)) and even without, e.g. trying to find similar items via latent features.


A 'topic model' could refer just to an explicitly defined set of topics, but perhaps more usually refers to topic modelling as a whole.


What it is useful for:

  • sorting new documents with similar
  • finding interesting phrases

What it won't do for you:


This involves new methods, including LDA, NMF,


In text analysis and linguistics you will see the concept of latent semantics and distributional similarity, with various methods using SVD (with the wider idea of LSA), though neural net approaches are now also common.


LDA sorts tokens into an as-of-yet unknown amount of topics, and makes some modelling assumptions that lead to documents having a modest set of topics, and topics a modest set of tokens.


https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9040385/


A document can be associated with more than one topic.

Various methods yield a vector in a space that represents topics, where you may select the most significant -- which sounds a lot like soft clustering and fuzzy clustering, or LSA, or


It uses include

  • giving a quick summary on the topic contain - possibly visual
  • finding similar items,
  • describing the similarity of items
  • anything that can benefit from

...from documents, genomic samples, to computer vision (consider e.g. augmenting object detection).


Keep in mind that methods may introduce implicit choices that may impose an upper limit to how well it is suited for each task.


See also



topic modelling often judges topic coherence - a metric of how well two or more terms belong together.


This can be split into intrinsic models (working just on the dataset) and extrinsic (results )

Apparently the currently popular metrics are

  • UMass's
intrinsic
  • UCI's
based on PMI, asking how likely it is words co-occur
extrinsic


http://qpleple.com/topic-coherence-to-evaluate-topic-models/


https://monkeylearn.com/blog/introduction-to-topic-modeling/



So what's the difference between clustering and topic modeling?

It's more of a difference in the aim you have - topic modeling also tries to discover latent topics, whereas clustering.

Yet depending on your choice of algorithms, there isn't necessarily a lot of practical difference - both e.g. share a "it works better if you ask it for a data-appropriate number of groups". A document can be assigned to more than one topic, but that



https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation


Methods include Non-Negative Matrix Factorization (NMF or NNMF)





Topics


Think about topic modeling as a layperson asked to summarize by picking out intersting terms - because this also suggests the limitations.


Topics from articles, or conversations, or other types of documents, are somewhat varying tasks,

in assumptions like how, whether it comes from a known set or not, whether it comes from a specific field, how much it might change, how many there may be.


Because if you have a fixed set you want to detect within more formal language, learning the phrases associated with it is a relatively simple classification task.

...whereas detecting the topic on t



this is an


Als er al sprake zou zijn van wildgroei, dan is ontoereikende financiering van het onderwijs niet de onderliggende reden. De overheid is verantwoordelijk voor de reguliere bekostiging, waarmee scholen in staat worden gesteld te voldoen aan hun wettelijke verplichtingen. De overheidsbekostiging is toereikend om onderwijs te bieden dat voldoet aan de wettelijke eisen

  1. financiering #regering #onderwijs #juridish #scholen




Test summarization tasks on kamervragen, because they are already fairly specific to begin with









LDA

Latent Dirichlet Allocation

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation


Putting semantic numbers to words and to more

Computers and people and numbers

vector space representations, word embeddings, and more

Just count in a big table?
Word embeddings
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The above arguments have been pushing us towards an area where those vectors no longer represent just the words, but also contain some comparability to similar words.

You can manage to do that in many distinct ways. There are some older distributional similarity approaches - these were a little clunky in that they made for high dimensional, sparse vectors, where each one represented a specific context. They sometimes game more explainable results, but they were unwieldy to work with (and not very efficient either, and also more easily distracted).


"Word embeddings" often refer the word vectors but more compactly somehow: we try to somehow figure out a smaller/denser semantically useful vector.

Dense compared to the input: when you can see text as 'a series of one of ~100K possible words', then a vector of maybe two hundred dimensions already seems to do a good job packing enough meaning to e.g. give reasonable (dis)similarity for a good amount of things we test for.
Semantically useful in that those properties tend to be useful
often focusing on such (dis)similarity comparisons - e.g. 'bat' may get a sense of animal, tool, and verb, and noun -- or rather, in this space appear close to animals, and tools, and verbs, than e.g. 'banana' does.


"This amounts to even more up-front work, so why do this dense semantic thing?"

One is practical - classical methods run into some age-old machine learning problems like that the more dimensions you add, the more you run into issues like the curse of dimensionality, sparsity. There happen to be issues that various word embedding methods sidestep somewhat. By cheating, yes, but by cheating pretty well.


Also, putting all words in a single space lets us compare terms, sentences, and documents. If the vectors are good (for the scope you have), this can be a good approximation of semantic similarity.

If we can agree on a basis between uses (e.g. build a reasonable set of vectors per natural language) we might even be able to give a basic idea of e.g. what any unseen new document is about.

(The last turns out to be optimistic, for a slew of reasons. You can often get better results out of becoming a little domain-specific.)



"So how do you get these dense semantic vectors?"

There are varied ways.


You could start with well-annotated data (that e.g. helps define similarity/dissimilarity), and that might be of higher quality in the end.

But it is hard to come by annotation for many aspects over an entire language, and it's a lot of work to even try - and that's still ignoring details like contextual ambiguity, analyses even people wouldn't quite agree on, the fact you have to impose a particular system so if it doesn't encode something you wanted, you have more work yet later, to fix that.


A recent trend is to put a little more trust in the assumptions of the distributional hypothesis, e.g. that words in similar context will be comparable.

This helps in that we can now use non-annotated data. We need a lot more of it for comparable quality to a lot les unannotated data, yes, but people have collective produced a lot of text, internet and other.


Even this "use distributional hypothesis" angle isn't very specific, and has been was done in very varied ways over the years.

A short history might be interesting, but for now let's point out that recent technique is word2vec, which doesn't do a lot more than looking what appears in similar contexts. (Its view is surprisingly narrow - what happens in a small window in a lot of data will tend to be more consistent. A larger window is not only more work but often too fuzzy) Its math is apparently fairly like the classical matrix factorization.


Word embeddings often refers to learning vectors from context, though there are more varied meanings (some conflicting), so you may wish to read 'embeddings' as 'text vectors' and figure out yourself what the implementation actually is.


Static embeddings
Contextual word embeddings
Subword embeddings
The hashing trick (also, Bloom embeddings)
Now we have nicer numbers, but how how I use them?
vectors - unsorted

Moderately specific ideas and calculations

Collocations

Collocations are statistically idiosyncratic sequences.

The math that is often used to find these asks something like "do these adjacent words occur together more often than the individual occurrence of the separate word would suggest?".

...though ideally is a little more refined than that.


This doesn't ascribe any meaning, or comparability, it just tends to signal anything from jargon, various substituted phrases, empty habitual etiquette, and many other things that go beyond purely compositional construction, because why other than common sentence structures would they co-occur so often?

...actually, there are varied takes on how useful collocations are, and why.

latent semantic analysis

Latent Semantic Analysis (LSA) is the application of Singular Value Decomposition on text analysis and search.


random indexing

https://en.wikipedia.org/wiki/Random_indexing


Topic modeling

Roughly the idea given documents that are about a particular topic, one would expect particular words to appear in the each more or less frequently.

Assuming such documents sharing topics, you can probably find groups of words that belong to those topics.

Assuming each document is primarily about one topic, you can expect a larger set of documents to yield multiple topics, and an assignment of one or more of these topics, so act like a soft/fuzzy clustering.

This is a relatively weak proposal in that it relies on a number of assumptions, but given that it requires zero training, it works better than you might expect when those assumptions are met. (the largest probably being your documents having singular topics).


https://en.wikipedia.org/wiki/Topic_model


word2vec

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

See *2vec#word2vec

GloVe

Coding stuff for training -- - one-hot coding and such