✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

(For relative brevity, the below omits details on training and validation, including terms like held-out, deleted interpolation. While important to execution, it's something you can look up when implementing)

Data sparsity is problematic for various statistical data models, such as simple maximum likelihood models, such as bag-of-words, n-gram, HMM, and others.

For example, in the set of all possible trigrams that happen in natural text (e.g. for language modeling or such), most trigrams have not happened in many years of text, and probably never will, because natural language has too much structure for that.

When you base probability (or such) fairly directly on appearance count, you can get a few problems:

components that occur very few times - relative probabilities make less sense on the rare end of the scale. Consider two items: one that appears once, one that appears three times. How fair is it to say that one is 300% as probable as the other, particularly when comparing two things that are almost never happen?

most of your trained data is spent on things you really don't care about

items that were not in your training data / trained model. It's hard to say what you want in such a case;
- If you take a zero count to mean zero probability, you say it will never happen and it should drop out completely, which is a decision too harsh for most goals
- ignoring components of an item can be that are not in training data may work well enough, except you can run into problems where the unseen items consists significantly of unknown components (or where the unknowns would in reality be the more telling ones, if looked at more specifically). You would get more variation in your estimations than you'ld want.
- misspellings and different word order can sometimes make likelihood estimations take an unnecessary nosedive

When we're talking about likelihood, all ways of smoothing over these problems amount to changing the probability mass distribution somehow.

The most basic solutions are to fake the unknowns by pretending they exist but are simply rare. Some may try to be cleverer and boost some more than others.

The probability mass shift is also the reason many of the names have discounting in them: You take some probability from the likelier items to give them to rarer things. Do it too eagerly and you overestimate the rare and underestimate common things.

In some ways, assuming very low probability for unseen items (using only items we have halfway decent information on) has roughly the same effect and flaws as ignoring them, e.g. the nosedive mentioned. Smarter fallbacks of likelihood, even if more approximate, are often preferable.

For n-gram models

The larger the n in the n-gram, the larger the problem, because the more likely it is that you don't have something in your training data - you could say you need exponentially more data to get a decent estimation for larger n. For many types of data, n≥4 is already problematic. More data is good, but at some point a little fallback and smoothing is easier.

While not always mentioned along with smoothing, very low counts in count data are also a little weird. For example, if some n-gram appeard once and another twice, is it really twice as probable? (statistically, using very little information is problematic)

The smarter estimation (than assuming rareness) in such models often fall back onto a n-1 (one-piece-shorter) items.

Various solutions

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Additive smoothing (add-one, add-δ)

The simplest workaround to the zero-multiplication/zero-probability problem is add-one smoothing, a.k.a. Laplace smoothing, which adds one to every count (generally: to known_vocabulary + unknowns).

In other words, pretend you saw everything a little more.

This is only really a fix to the zero-multiplication problem.

That is, The way it shifts the probability mass is not very natural

it gives a disproportional boost to rare things, and a lot of probability mass is taken away from unknowns when there a lot of unknowns.

It means underestimating commonly seen things somewhat, though this is often less of an issue. This still depends on how high the counts generally are (because adding one to them is a different amount depending on that).

In simpler applications this doesn't matter much, roughly because even if the probabilities are skewed, it doesn't change the order of the predicted items. The more that probabilities are combined, the more it will break things. (It is, for example, a bad idea in n-gram models where you mix n-grams of different lengths though is a hard task with any method due to what it asks)

Add-δ smoothing, a.k.a. Lidstone smoothing, is a variation that adds a smaller number, something between 0 and 1 and possibly based on the trained data. It mainly just plays with the magnitude of the distribution shift.

Apparently, if the value is 0.5 this is also known as Expected Likelihood Estimation (ELE) and the Jeffreys-Perks Law. (verify)

All of these amount to saying

all unknowns do exist, just very unlikely

and generally have rather lower effect than known-but-rare items

handle all unknowns identically

All of those are skewed, so whenever you can estimate more cleverly, you probably should consider it - and in many underlying models that's not very hard.

Various (parts of) methods of cleverness are somewhat portable between specific models.

Absolute discounting

Witten-Bell discounting

Intuitively: Unseen n-grams probably exist, but are just unseen until now.

So give an estimation of seeing a new n-gram like this?(verify)).

(Can be considered an instance of Jelinek-Mercer smoothing(verify))

Good-Turing smoothing / Good-Turing Discounting

Redistributes the probability mass.

Assumes binomial distribution (though leaves better-known things alone).

Its most basic definition is more of an idea, and not robust by itself, but a good base for some more nicely behaved smoothers (involving backoff/interpolation)

Church-Gale smoothing

Good-Turing plus bucketing.

Kneser-Ney Smoothing

Probably the best available with the smoothing category, in that it uses few-count items cleverly effectively.

Similar to Witten-Bell in that it considers context, and can be considered an extension to absolute discounting

Works better with interpolation than with backoff

Modified Kneser-Ney Smoothing (introduced in Chen & Goodman) is an variation that seems to fairly consistently perform somewhat better.

Combined estimators

These aren't really just smoothing anymore, including many the variations of interpolation and backoff - which uses information on shorter items to estimate longer items with unseen parts. Intuitively: that data is necessarily not as sparse, and certainly a better estimate than additive smoothing.

For example, if both "john likes" and "john spork" are unknown, then you can estimate that 'john likes' is more likely by using the unigram 'likes' and 'spork'. If the trigram john likes mnemonics is unseen, you may find you have the bigram bigram john likes. While not actually what you asked, it's still a much better estimation than additive smoothing gives you, particularly since that would happen a lot for larger n.

Backoff: if there is no information on a given n-gram, use (n-1)-grams within it.

Interpolation: Estimate probability of every n-gram is estimated using (n-1)-grams - which is a recursive definition. It's also a relatively general idea that can be and is tweaked in many ways.

These are sensible ways of using larger n-grams without necessarily causing a huge shift in probability mass.

It's not very hard to alter a backoff algorithm into an interpolating one, and vise versa, though most algorithms are primarily seen as just one.

(Simple) Linear interpolation is a weighed combination of basic maximum likelihood and use of lower-order n-grams (a recursive definition).

We want to weigh each n-gram based on training data - and we want to train that from a data set distinct from that used for ML training. The common (named) choices:

Even with deleted interpolation, you won't easily get enough data.

Jelinek-Mercer smoothing - interpolating method that tries to get more coverage using further fuzziness, namely bucketing (), based on counts. Which can help even fairly small training sets perform decently.

There are variations that use different bucketing criteria

Apparently not used much anymore; there are simpler options that

Katz smoothing, a.k.a. Katz backoff (n-gram based) - Extends Good-Turing redistributing counts but using different-order models (can be said to be backoff). Often introduced for bigrams and unigrams, but can be taken recursively.

Reverts to use shorter n-gram context if count for the n-gram is lower than a given threshold.

Things that need to reserve data (Jelinek-Mercer, Witten-Bell, ) are not used so much anymore, because there are methods that don't require it (e.g. Kneser-Ney), which tend to work better.

-->

Smoothing sparse data

Contents

Feature models and smoothing

Sparse data smoothing

The problem

Various solutions

Additive smoothing (add-one, add-δ)

Absolute discounting

Witten-Bell discounting

Good-Turing smoothing / Good-Turing Discounting

Church-Gale smoothing

Kneser-Ney Smoothing

Combined estimators

Unsorted

Class-based smoothing

Held-out estimation

See also

Navigation menu