Difference between revisions of "Smoothing sparse data"

From Helpful
Jump to: navigation, search
(Created page with "{{math notes}} ==Feature model notes== ===Bag of words/features (assumption)=== The '''bag-of-words model''' {{comment|(more broadly '''bag-of-features model''')}} use single...")
Line 496: Line 496:
[[Category:Math on data]]

Revision as of 21:49, 28 June 2020

This is more for overview of my own than for teaching or exercise.

Overview of the areas

Arithmetic · 'elementary mathematics' and similar concepts
Set theory, Category theory
Geometry and its relatives · Topology
Elementary algebra - Linear algebra - Abstract algebra
Calculus and analysis
 : Information theory · Number theory · Decision theory, game theory · Recreational mathematics · Dynamical systems · Unsorted or hard to sort

Math on data:

  • Statistics as a field
some introduction · areas of statistics
types of data · on random variables, distributions
Virtues and shortcomings of...
on sampling · probability
glossary · references, unsorted
Footnotes on various analyses

Other data analysis, data summarization, learning

Regression · Classification, clustering, decisions · dimensionality reduction · Optimization theory, control theory
Connectionism, neural nets · Evolutionary computing

Feature model notes

Bag of words/features (assumption)

The bag-of-words model (more broadly bag-of-features model) use single words/features as an (unordered) set.

The implied assumption are that the features are independent, so order and adjacency is unimportant.

From the n-gram perspective, unigram models amount to this.

You can also see it as a voting system.

In text processing

One common example is Naive Bayes spam filtering , because its naivety essentially is this assumption that feature order does not matter.

Though real-world naive bayes spam filtering would take more complex features than single words (and may re-introduce order via n-grams or such), simple example versions may use 1-grams.

Other types of classifiers also make this assumption, or make it easy.

In image recognition / image search

The bag-of-features refers to models that recognize parts of a whole, and used in a number image tasks, such as feature extraction, object/image recognition, image search, (more flexible) near-duplicate detection, and such.

The idea that you can describe an image by the collection of small things we recognize in it, and that combined presence is typically already a strong indicator (particularly when you add some hypothesis testing). Exact placement can be useful, but often easily secondary.

See also:

Naive Bayes

Some background

A certain Reverend Thomas Bayes said a number of things about probability, used mostly in Bayesian inference in statistics, and in Naive Bayes classification.

Classification, in general, is usually based on a trained model:

  • deciding on a set of distinct target classes (say, exclusives as 'spam' vs. 'not spam', or rough article subjects (preferably exclusive, or classification itself does not make that much sense, although the value-for-class assignments still may)
  • deciding some features to look at, that classification will be based on
  • learning the importance of features to each class (often using good examples for each class)

Given something unseen to classify, you extract its features the same way you did for the training set, and see which class compares best. Classification is regularly many fuzzy measures of fit followed by a hard choice of which seems best.

Naive Bayes refers to one of the simpler algorithms that does classification this way.

Its definition has a few notational conventions:

  • Each training class is referred to as c in the set of classes C (c1...ci)
  • We eventually want to judge an unseen document D
  • and what we want for D is is the class for which the probability the document fits that class is highest

You can express such Bayesian classification as:

the c for which P(c|D) is highest 

or, a bit more formally yet,

c = maxarg( P(ci|D) )

Bayesian Inversion

The P(c|D) above is not obvious to calculate, because it immediately asks us for a best choice for a document, but a thorough answer will probably depend on knowledge the document's features, all classes, and all documents.

So we need to break the problem down before we can give an estimation of the probability P(c|D).

The first step is to rewrite using the concept of Bayesian inversion, based on Bayes's rule (which pops up in various other types of probability calculations and estimations).

             P(c) * P(D|c) 
P(c|D)  =  -----------------

...applied for each class c in C.

P(c) is the base probability of each class, that is, the probability that a document is in class c given no document information.

This is sometimes assumed to be equal for all classes (in which case it falls away),
sometimes it is estimated, for example using the assumption that the amount of documents for each class (in the training set) is strong indication of how commonly one of its items is seen.

P(D) is the chance of finding the document, in general.

This is usually assumed to be independent of the classification problem.
Because of that, and because it is hard to realistically estimate for an unseen document, it is often assumed to be equal for all documents (so falls away).

P(D|c) is the interesting one. It asks for the probability that a particular document in a particular class.

This is still abstract, and not yet a directly calculable answer, but it is a smaller problem than we had a few paragraphs up: Unlike the earlier form (P(c|D)), it deals not with an immediate choice between alternatives, but only with an calculation/estimation of fit, of 'how well does a specific Document fit in this specific class?' .
This degree of fit for a document in a class can be done with any method that judges similarity, as long as it is (fairly) stable/consistent between documents, and it can be calculated (roughly) equally well for seen and unseen documents.

Long story somewhat shorter, Bayesian inversion reduces the classification problem from

"Choose best class immediately"


"find the class for which our estimation of fit is highest" (by making an estimator that can judge individual documents, and can do so stably / comparably well for all documents)

or from:

maxarg( P(ci|D) )


maxarg( P(ci) * P(D|ci) )

or, assuming equal base-class probabilities, to:

maxarg( P(D|ci) )

Features, Words, and Naivety

So let's talk about an implementation.

If we want to calculate a class's fit based on features, we need to choose those features, and a method to calculate an arbitrary document's fit to those features.

It's nice if each feature is fairly robust in that it does not react in an unnecessarily noisy way, and productive for the classification task at hand.

In Naive Bayes introductions, the choice of features is often words (some introductions talk mainly of words, some talk more broadly about features), and the probabilities are based on the relative occurrence of each chosen word.

Words are a specific implementation choice, but are a nice introduction, in part because it speaks to the imagination in cases such as spam (even though it's limited for this, given that spammers know and try to defeat this model).

If you're programming such a toy example, then the amount of words is often reduced to maybe a thousand. The actual selection is interesting, in that they ought to be distinguishing within the training set ("the" may be common but not useful), but not be so unusual that they are rare within unseen documents.

There are cleverer ways of choosing words, and you probably want to consider the Zipf distribution of words, but to keep the example simple, let's say we take the most common 1000 in the training set, minus some obvious stopwords.

Naive Bayes is often explained by saying we want a probability based on n features in feature set F: P(c|f1..fn), which Bayes-inverses and denominator-eliminates to:

P(c) * P(f1..fn|c)

...which roughly means "how do features in this new document compare to the one for the class?"

Technically, P(f1..fn|c) should expand to a very long formula, as each feature depends on others. That's where the naivety of Naive Bayes comes in: the Naive Bayes assumption is simply that all features are independent of all others. This is rarely true, but is much simpler to work with, is evaluated much faster, and works better than you might expect. (this also effectively makes it a bag-of-words model)

It means calculation of the probability of the class based of the features simplifies to a simple multiplication of individual feature values / probabilities. Given there are n features:

P(c) * Πi=1 to n P(fi|c)

Note that the words-as-features setup is a #Bag_of_words.2Ffeatures_.28assumption.29 assumption: it plus the naivety means that word order is ignored. This means that potentially indicative phrases, collocations, and such are represented only weakly at best, by their words' independent probabilities being slightly higher. Word n-grams are one possible fix, but will likely run you into a sparseness problem since n-grams are implicitly rarer than single words. Even just bi-grams may be problematic - if you take all possible ones from a large set of documents, you'll find that most documents have a tiny fraction of the overall real use.

Bayesian classification as a procedure...

Naive Bayes training comes down to:

  • Deciding which set of features (F) you will use.
  • Train, which calculates characteristics for each class, namely:
    • P(c) for each class (see above), and more importantly,
    • all P(f|c): the probability of each feature f in each class c

Later, when classifying an unseen document, you calculate P(c) * P(f1..fn|c) for it:

  • for each c in C:
    • take P(c), times
    • the calculated probability of each feature in class c, i.e. each P(f|c) based on the document.
  • Choose the one with the highest probability, and you've got a classifier.

Choosing your features, and further discussion

Naive Bayes with simple word features works fairly well on various types of text, since it essentially notices differences in vocabulary use, the theory being that that is indicative of the document class. Common words can be useful when they aren't uniformly common between all classes, uncommon ones since they tend to up the probability of just a few classes.

You could simply use all words in all documents and end up with tens of thousands of features, although it would help speed to prune that a little. Which ones to leave is a choice you can be halfway smart about - although note that too much tweaking just means it will work better on the training data, with no guarantees for unseen data.

Also note that when you pre-process the training data, you need to process the unseen documents the same way. When choosing the features, you should consider what this may do to unseen documents.

A feature probability of zero is evil to calculations as it randomly destroys comparability in the both training and classification steps. It can happen fairly easily, such as when a word in the vocabulary doesn't occur in a document. A common solution is to cheat and pretend you did actually see it a little. There are a few different ways; some disturb the relative probabilities less than others.

You can add other features. The main property they should have is that their probabilities compare well, are in the same sort of scale. You could for example add important words a few times, a hackish way of weighing them more. You could for example add select word bigrams or add character n-grams.

You can alter the model further, observing that what you actually do is per-class strengthening or weakening of a probability that it will win. It doesn't even have to be a simple probability, even; you could for example include sentence length, which is not a direct probability, so you would have to e.g. remember the average sentence length in each class, calculate the average for the unseen document and define some metric of similarity. This isn't Naive Bayes anymore, but as a probability-based classifier it may work a little better if you choose your alterations well.


This example pseudocode uses only word features, and some basic add-one(verify) smoothing to avoid the zero problem.

Training (word features)

U = feature universe
D = document set (text,class pairs)
C = set of classes in D
foreach c in C:
  BaseProb(c) = |D with class c| / |D|
  Text = concatenation of documents in current class
  N    = |tokens in Text|
  foreach token t in U:
    P(t|c) = ( numoccurence(t in Text) + 1 ) / (numtokens(Text)+|U| )

Classifying (word features)

foreach c in C:
  foreach token in Document
    p = p*P(token|c)

...then collect each of these probabilities and choose the class with the highest probability.

In reality, the probabilities quickly become very small and you run into the problem that standard 32-bit or 64-bit floating point numbers cannot accurately contain them. A common workaround for this is to add the logs of the probabilities (yielding negative numbers) instead of multiplying the probabilities. These values are much less likely to scale out of control, and they are as just as monotonous as the probabilities.

N-gram notes

N-grams are contiguous sequence of length n.

They are most often seen in computational linguistics.

Applied to sequences of characters it can be useful e.g. in language identification, but the more common application is to words.

As n-grams models only include dependency information when those relations are expressed through direct proximity, they are poor language models, but useful to things working off probabilities of combinations of words, for example for statistical parsing, collocation analysis, text classification, sliding window methods (e.g. sliding window POS tagger), (statistical) machine translation, and more

For example, for the already-tokenized input
the 2-grams would be:


An extension of n-grams that allow allowing a certain amount of arbitrarily positioned omissions from the input.

Skip-grams come from speech analysis, processing phonemes.

In word-level analysis their purpose is a little different. You could say that we acknowledge the sparsity problem, and decide to get more out of the data we have (focusing on context) rather than trying to smooth.

Continuous bag of words

Sparse data smoothing

This article is an attempt at a decent overview, introduction, summary, or such.

It may currently just contain a mess of information. If it is likely to keep sucking it will probably be removed later.

The problem

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

(For relative brevity, the below omits details on training and validation, including terms like held-out, deleted interpolation. While important to execution, it's something you can look up when implementing)

Data sparsity is problematic for various statistical data models, such as simple maximum likelihood models, such as bag-of-words, n-gram, HMM, and others.

For example, in the set of all possible word trigrams (for language modeling or such), most have not happened in many years of text, and probably never will.

In genera, when you base probability (or such) fairly directly on appearance count, you can get a few problems:

  • components that occur very few times - relative probabilities make less sense on the rare end of the scale. Consider two items: one that appears once, one that appears three times. How fair is it to say that one is 300% as probable as the other, particularly when comparing two things that are almost never happen?
  • most of your trained data is spent on things you really don't care about
  • items that were not in your training data / trained model. It's hard to say what you want in such a case;
    • If you take a zero count to mean zero probability, you say it will never happen and it should drop out completely, which is a decision too harsh for most goals
    • ignoring components of an item can be that are not in training data may work well enough, except you can run into problems where the unseen items consists significantly of unknown components (or where the unknowns would in reality be the more telling ones, if looked at more specifically). You would get more variation in your estimations than you'ld want.
    • misspellings and different word order can sometimes make likelihood estimations take an unnecessary nosedive

When we're talking about likelihood, all ways of smoothing over these problems amount to changing the probability mass distribution somehow.

The most basic solutions are to fake the unknowns by pretending they exist but are simply rare. Some may try to be cleverer and boost some more than others.

The probability mass shift is also the reason many of the names have discounting in them: You take some probability from the likelier items to give them to rarer things. Do it too eagerly and you overestimate the rare and underestimate common things.

In some ways, assuming very low probability for unseen items (using only items we have halfway decent information on) has roughly the same effect and flaws as ignoring them, e.g. the nosedive mentioned. Smarter fallbacks of likelihood, even if more approximate, are often preferable.

For n-gram models

The larger the n in the n-gram, the larger the problem, because the more likely it is that you don't have something in your training data - you could say you need exponentially more data to get a decent estimation for larger n. For many types of data, n≥4 is already problematic. More data is good, but at some point a little fallback and smoothing is easier.

While not always mentioned along with smoothing, very low counts in count data are also a little weird. For example, if some n-gram appeard once and another twice, is it really twice as probable? (statistically, using very little information is problematic)

The smarter estimation (than assuming rareness) in such models often fall back onto a n-1 (one-piece-shorter) items.

Various solutions

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Additive smoothing (add-one, add-δ)

The simplest workaround to the zero-multiplication/zero-probability problem is add-one smoothing, a.k.a. Laplace smoothing, which adds one to every count (generally: to known_vocabulary + unknowns). In other words, pretend you saw everything a little more.

This is only really a fix to the zero-multiplication problem. The way it shifts the probability mass is not very natural - it gives a disproportional boost to rare things, and a lot of probability mass is taken away from unknowns when there a lot of unknowns.

It means underestimating commonly seen things somewhat, and overestimating unknowns (particularly relative to known but rare items). The effect depends on how much you add, and how high the counts are.

(In some methods this doesn't matter much because none of changes the order of the items, just the actual values. In some cases it will completely break things, though; it's a bad idea in n-gram models where you mix n-grams of different lengths)

Add-δ smoothing, a.k.a. Lidstone smoothing, is a variation that adds a smaller number, something between 0 and 1 and possibly based on the trained data. It mainly just plays with the magnitude of the distribution shift.

If the value is 0.5 this is also known as Expected Likelihood Estimation (ELE) and the Jeffreys-Perks Law.

All of these amount to saying that unknowns do exist and are very unlikely (and has less effect on well-known items), and handle all unknowns identically. If you can estimate more cleverly than that you probably should - and in many underlying models that's not very hard.

Various (parts of) methods of cleverness are somewhat portable between specific models.

Absolute discounting

Witten-Bell discounting

Intuitively: Unseen n-grams are just ones that are unseen until now. So give it an estimation (probability of seeing a new n-gram like this?(verify)).

(Can be considered an instance of Jelinek-Mercer smoothing(verify))

'Good-Turing smoothing / Good-Turing Discounting

Redistributes the probability mass. Assuming binomial distribution (though leaves better known things alone).

Its most basic definition is more of an idea, and not robust by itself, but a good base for some more advanced smoothers (involving backoff/interpolation)

See also:

Church-Gale smoothing

Good-Turing plus bucketing.

Kneser-Ney Smoothing

Probably the best available with the smoothing category, in that it uses few-count items cleverly effectively.

Similar to Witten-Bell in that it considers context, and can be considered an extension to absolute discounting

Works better with interpolation than with backoff

Modified Kneser-Ney Smoothing (introduced in Chen & Goodman) is an variation that seems to fairly consistently perform somewhat better.

Combined estimators

These aren't really just smoothing anymore, including many the variations of interpolation and backoff - which uses information on shorter items to estimate longer items with unseen parts. Intuitively: that data is necessarily not as sparse, and certainly a better estimate than additive smoothing.

For example, if both "john likes" and "john spork" are unknown, then you can estimate that 'john likes' is more likely by using the unigram 'likes' and 'spork'. If the trigram john likes mnemonics is unseen, you may find you have the bigram bigram john likes. While not actually what you asked, it's still a much better estimation than additive smoothing gives you, particularly since that would happen a lot for larger n.

Backoff: if there is no information on a given n-gram, use (n-1)-grams within it.

Interpolation: Estimate probability of every n-gram is estimated using (n-1)-grams - which is a recursive definition. It's also a relatively general idea that can be and is tweaked in many ways.

These are sensible ways of using larger n-grams without necessarily causing a huge shift in probability mass.

It's not very hard to alter a backoff algorithm into an interpolating one, and vise versa, though most algorithms are primarily seen as just one.

(Simple) Linear interpolation is a weighed combination of basic maximum likelihood and use of lower-order n-grams (a recursive definition).

We want to weigh each n-gram based on training data - and we want to train that from a data set distinct from that used for ML training. The common (named) choices:

Even with deleted interpolation, you won't easily get enough data.

Jelinek-Mercer smoothing - interpolating method that tries to get more coverage using further fuzziness, namely bucketing (), based on counts. Which can help even fairly small training sets perform decently.

There are variations that use different bucketing criteria

Apparently not used much anymore; there are simpler options that

Katz smoothing, a.k.a. Katz backoff (n-gram based) - Extends Good-Turing redistributing counts but using different-order models (can be said to be backoff). Often introduced for bigrams and unigrams, but can be taken recursively.

Reverts to use shorter n-gram context if count for the n-gram is lower than a given threshold.

Things that need to reserve data (Jelinek-Mercer, Witten-Bell, ) are not used so much anymore, because there are methods that don't require it (e.g. Kneser-Ney), which tend to work better.


See also:


Class-based smoothing

Absolute discounting

Held-out estimation

See also

  • Chen & Goodman, An Empirical Study of Smoothing Techniques for Language Modeling [1]
    • ...its references