Difference between revisions of "Data modeling, restructuring, analysis, fuzzy cases, learning"

From Helpful
Jump to: navigation, search
m (CNNs)
m (Unsorted)
(One intermediate revision by the same user not shown)
Line 2,939: Line 2,939:
 
====Deep?====
 
====Deep?====
 
<!--
 
<!--
 +
Deep learning is basically any technique (but largely NNs) that can learn unsupervised from unstructured/unlabeled data.
 +
 +
While it may initially learn slower than something would from well constructed data,
 +
one of the ideas is that you can throw much larger amounts of data at it and it keeps learning.
 +
 +
This also implies it's a feature learner
 +
 +
 +
Deep Boltzmann machine
 +
https://en.wikipedia.org/wiki/Boltzmann_machine#Deep_Boltzmann_machine
 +
 +
 
* Deep Belief Network - feed-forward stack  
 
* Deep Belief Network - feed-forward stack  
  
Line 3,105: Line 3,117:
  
 
-->
 
-->
 +
 +
=Semi-sorted=
 +
 +
 +
 +
==Feature model notes==
 +
===Bag of words/features (assumption)===
 +
The '''bag-of-words model''' {{comment|(more broadly '''bag-of-features model''')}} use single words/features as an (unordered) set.
 +
 +
The implied assumption are that the features are independent, so order and adjacency is unimportant.
 +
 +
From the n-gram perspective, unigram models amount to this.
 +
 +
You can also see it as a voting system.
 +
 +
 +
====In text processing====
 +
One common example is [[#Naive bayes|Naive Bayes]] spam filtering , because its naivety essentially ''is'' this assumption that feature order does not matter.
 +
 +
Though real-world naive bayes spam filtering would take more complex features than single words (and may re-introduce order via n-grams or such), simple example versions may use 1-grams.
 +
 +
Other types of classifiers also make this assumption, or make it easy.
 +
 +
 +
====In image recognition / image search====
 +
 +
The bag-of-features refers to models that recognize parts of a whole, and used in a number image tasks, such as feature extraction, object/image recognition, image search, (more flexible) near-duplicate detection, and such.
 +
 +
The idea that you can describe an image by the collection of small things we recognize in it, and that combined presence is typically already a strong indicator (particularly when you add some hypothesis testing).
 +
Exact placement can be useful, but often easily secondary.
 +
 +
See also:
 +
* http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html
 +
* http://labelme.csail.mit.edu/
 +
 +
[[Category:Math on data]]
 +
 +
===Naive Bayes===
 +
====Some background====
 +
A certain Reverend [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] said a number of things about probability,
 +
used mostly in [[Bayesian inference]] in statistics, and in Naive Bayes classification.
 +
 +
 +
'''[[Classification]]''', in general, is usually based on a trained model:
 +
* deciding on a set of distinct target classes (say, exclusives as 'spam' vs. 'not spam', or rough article subjects {{comment|(preferably exclusive, or classification itself does not make that much sense, although the value-for-class assignments still may)}}
 +
* deciding some features to look at, that classification will be based on
 +
* learning the importance of features to each class (often using good examples for each class)
 +
 +
Given something unseen to classify, you extract its features the same way you did for the training set, and see which class compares best. Classification is regularly many fuzzy measures of fit followed by a hard choice of which seems best.
 +
 +
 +
 +
'''Naive Bayes''' refers to one of the simpler algorithms that does classification this way.
 +
 +
Its definition has a few notational conventions:
 +
* Each training class is referred to as '''<tt>c</tt>''' in the set of classes '''<tt>C</tt>''' {{comment|(<tt>c<sub>1</sub></tt>...<tt>c<sub>i</sub></tt>)}}
 +
* We eventually want to judge an unseen document '''<tt>D</tt>'''
 +
* and what we want for '''<tt>D</tt>''' is is the class for which the probability the document fits that class is highest
 +
 +
 +
You can express such Bayesian classification as:
 +
the c for which <tt>P(c|D)</tt> is highest
 +
 +
or, a bit more formally yet,
 +
c = maxarg( P(c<sub>i</sub>|D) )
 +
 +
====Bayesian Inversion====
 +
 +
The <tt>P(c|D)</tt> above is not obvious to calculate, because it immediately asks us for a best choice for a document, but a thorough answer will probably depend on knowledge the document's features, all classes, and all documents.
 +
 +
 +
So we need to break the problem down before we can give an estimation of the probability <tt>P(c|D)</tt>.
 +
 +
The first step is to rewrite using the concept of Bayesian inversion, based on '''Bayes's rule''' {{comment|(which pops up in various other types of probability calculations and estimations)}}.
 +
              P(c) * P(D|c)
 +
P(c|D)  =  -----------------
 +
                  P(D)
 +
...applied for each class <tt>c</tt> in <tt>C</tt>.
 +
     
 +
 +
'''P(c)'''  is the base probability of each class, that is, the probability that a document is in class ''c'' given ''no'' document information.
 +
: This is sometimes assumed to be equal for all classes {{comment|(in which case it falls away)}},
 +
: sometimes it is estimated, for example using the assumption that the amount of documents for each class (in the training set) is strong indication of how commonly one of its items is seen.
 +
 +
'''P(D)''' is the chance of finding the document, in general.
 +
: This is usually assumed to be independent of the classification problem.
 +
: Because of that, and because it is hard to realistically estimate for an unseen document, it is often assumed to be equal for all documents (so falls away).
 +
 +
'''P(D|c)''' is the interesting one. It asks for the probability that a particular document in a particular class.
 +
: This is still abstract, and not yet a directly calculable answer, but it is a smaller problem than we had a few paragraphs up: Unlike the earlier form {{comment|('''P(c&#x7c;D)''')}}, it deals not with an immediate choice between alternatives, but only with an calculation/estimation of fit, of '' 'how well does a specific '''D'''ocument fit in this specific '''c'''lass?' ''.
 +
 +
: This degree of fit for a document in a class can be done with any method that judges similarity, as long as '''it is (fairly) stable/consistent''' between documents, and it '''can be calculated (roughly) equally well for seen and unseen documents'''.
 +
 +
 +
Long story somewhat shorter,
 +
Bayesian inversion reduces the classification problem from
 +
: "Choose best class immediately"
 +
to
 +
: "find the class for which our estimation of fit is highest" {{comment|(by making an estimator that can judge individual documents, and can do so stably / comparably well for all documents)}}
 +
 +
or from:
 +
maxarg( P(c<sub>i</sub>|D) )
 +
to:
 +
maxarg( P(c<sub>i</sub>) * P(D|c<sub>i</sub>) )
 +
or, assuming equal base-class probabilities, to:
 +
maxarg( P(D|c<sub>i</sub>) )
 +
 +
====Features, Words, and Naivety====
 +
So let's talk about an implementation.
 +
 +
If we want to calculate a class's fit based on features, we need
 +
to choose those features, and a method to calculate an arbitrary document's fit to those features.
 +
 +
It's nice if each feature is
 +
''fairly robust'' in that it does not react in an unnecessarily noisy way,
 +
and ''productive'' for the classification task at hand.
 +
 +
 +
 +
In Naive Bayes introductions, the choice of features is often words {{comment|(some introductions talk mainly of words, some talk more broadly about features)}}, and the probabilities are based on the relative occurrence of each chosen word.
 +
 +
Words are a specific implementation choice, but are a nice introduction, in part because it speaks to the imagination in cases such as spam {{comment|(even though it's limited for this, given that spammers know and try to defeat this model)}}.
 +
 +
 +
If you're programming such a toy example, then the amount of words is often reduced to maybe a thousand.
 +
The actual selection is interesting, in that they ought to be distinguishing within the training set ("the" may be common but not useful), but not be so unusual that they are rare within unseen documents.
 +
 +
There are cleverer ways of choosing words, and you probably want to consider the [http://en.wikipedia.org/wiki/Zipf_distribution Zipf distribution] of words, but to keep the example simple, let's say we take the most common 1000 in the training set, minus some obvious stopwords.
 +
 +
<!--
 +
In fact, spam filters may be bayesian but will have more specific features, such as 'apparently sent to self', 'sent to dozens of people', 'uses many non-dictionary words', some estimation of 'uses misspellings likely to be read over but avoid exact matchers', presence of common spam-phrases (e.g. based on regexp matches), trust or distrust of certain mail networks, etc.
 +
-->
 +
 +
 +
Naive Bayes is often explained by saying we want a probability based on <tt>n</tt> features in feature set F: <tt>P(c|f<sub>1</sub>..f<sub>n</sub>)</tt>, which Bayes-inverses and denominator-eliminates to:
 +
P(c) * P(f<sub>1</sub>..f<sub>n</sub>|c)
 +
...which roughly means "how do features in this new document compare to the one for the class?"
 +
 +
 +
Technically, <tt>P(f<sub>1</sub>..f<sub>n</sub>|c)</tt> should expand to a very long formula, as each feature depends on others. That's where the naivety of Naive Bayes comes in: the '''Naive Bayes assumption''' is simply that all features are independent of all others. This is rarely true, but is much simpler to work with, is evaluated much faster, and works better than you might expect. (this also effectively makes it a [[bag-of-words]] model)
 +
 +
It means calculation of the probability of the class based of the features simplifies to a simple multiplication of individual feature values / probabilities. Given there are n features:
 +
P(c) * <span style="font-size:210%">&#928;</span><sub>i=1 to n</sub> P(f<sub>i</sub>|c)
 +
 +
 +
Note that the words-as-features setup is a [[#Bag_of_words.2Ffeatures_.28assumption.29]] assumption: it plus the naivety means that word order is ignored.  This means that potentially indicative phrases, collocations, and such are represented only weakly at best, by their words' independent probabilities being slightly higher.  Word n-grams are one possible fix, but will likely run you into a sparseness problem since n-grams are implicitly rarer than single words.
 +
Even just bi-grams may be problematic - if you take all possible ones from a large set of documents, you'll find that most documents have a tiny fraction of the overall real use.
 +
 +
====Bayesian classification as a procedure...====
 +
Naive Bayes training comes down to:
 +
* Deciding which set of features (F) you will use.
 +
* '''Train''', which calculates characteristics for each class, namely:
 +
** P(c) for each class (see above), and more importantly,
 +
** all P(f|c): the probability of each feature f in each class c
 +
 +
 +
Later, when classifying an unseen document, you calculate <tt>P(c) * P(f<sub>1</sub>..f<sub>n</sub>|c)</tt> for it:
 +
* for each c in C:
 +
** take P(c), times
 +
** the calculated probability of each feature in class c, i.e. each P(f|c) based on the document.
 +
* Choose the one with the highest probability, and you've got a classifier.
 +
 +
 +
====Choosing your features, and further discussion====
 +
Naive Bayes with simple word features works fairly well on various types of text, since it essentially notices differences in vocabulary use, the theory being that that is indicative of the document class. Common words can be useful when they aren't uniformly common between all classes, uncommon ones since they tend to up the probability of just a few classes.
 +
 +
 +
You could simply use all words in all documents and end up with tens of thousands of features, although it would help speed to prune that a little. Which ones to leave is a choice you can be halfway smart about - although note that too much tweaking just means it will work better on the training data, with no guarantees for ''unseen'' data.
 +
 +
Also note that when you pre-process the training data, you need to process the unseen documents the same way. When choosing the features, you should consider what this may do to unseen documents.
 +
 +
 +
A feature probability of zero is evil to calculations as it randomly destroys comparability in the both training and classification steps. It can happen fairly easily, such as when a word in the vocabulary doesn't occur in a document. A common solution is to cheat and pretend you did actually see it a little. There are a few different ways; some disturb the relative probabilities less than others.
 +
 +
 +
 +
You can add other features. The main property they should have is that their probabilities compare well, are in the same sort of scale. You could for example add important words a few times, a hackish way of weighing them more. You could for example add select word bigrams or add character n-grams.
 +
 +
You can alter the model further, observing that what you actually do is per-class strengthening or weakening of a probability that it will win.
 +
It doesn't even have to be a simple probability, even; you could for example include sentence length, which is not a direct probability, so you would have to e.g. remember the average sentence length in each class, calculate the average for the unseen document and define some metric of similarity. This isn't Naive Bayes anymore, but as a probability-based classifier it may work a little better if you choose your alterations well.
 +
 +
 +
====Pseudocode====
 +
This example pseudocode uses only word features, and some basic add-one{{verify}} smoothing to avoid the zero problem.
 +
 +
'''Training''' (word features)
 +
U = feature universe
 +
D = document set (text,class pairs)
 +
C = set of classes in D
 +
foreach c in C:
 +
  BaseProb(c) = |D with class c| / |D|
 +
  Text = concatenation of documents in current class
 +
  N    = |tokens in Text|
 +
  foreach token t in U:
 +
    <tt>P(t|c) = ( '''numoccurence'''(t in Text) + 1 ) / ('''numtokens'''(Text)+|U| )</tt>
 +
 +
 +
'''Classifying''' (word features)
 +
foreach c in C:
 +
  p=P(c)
 +
  foreach token in Document
 +
    p = p*P(token|c)
 +
...then collect each of these probabilities and choose the class with the highest probability.
 +
 +
 +
In reality, the probabilities quickly become very small and you run into the problem that standard 32-bit or 64-bit floating point numbers cannot accurately contain them.
 +
A common workaround for this is to add the logs of the probabilities (yielding negative numbers) instead of multiplying the probabilities.  These values are much less likely to scale out of control, and they are as just as monotonous as the probabilities.
 +
 +
==N-gram notes==
 +
 +
N-grams are contiguous sequence of length n.
 +
 +
 +
They are most often seen in computational linguistics.
 +
 +
 +
Applied to sequences of characters it can be useful e.g. in language identification,
 +
but the more common application is to words.
 +
 +
As n-grams models only include dependency information when those relations are expressed through direct proximity, they are poor language models, but useful
 +
to things working off probabilities of combinations of words, for example for statistical parsing, collocation analysis, text classification, sliding window methods (e.g. sliding window POS tagger), (statistical) machine translation, and more
 +
 +
 +
For example, for the already-tokenized input {{inlinecode|This}} {{inlinecode|is}} {{inlinecode|a}} {{inlinecode|sentence}} {{inlinecode|.}} the 2-grams would be:
 +
: {{inlinecode|This}},{{inlinecode|is}}
 +
: {{inlinecode|is}},{{inlinecode|a}}
 +
: {{inlinecode|a}},{{inlinecode|sentence}}
 +
: {{inlinecode|sentence}},{{inlinecode|.}}
 +
 +
 +
<!--
 +
N-grams for linguistic analysis regularly have the problem that if you look at ''all'' n-grams for a text, most of them appear just once or twice, which analyses don't always deal with well, and also requires a lot of memory for mostly-pointless stuff
 +
 +
In practice, n-grams models are often use bi-grams and tri-grams; unigrams often give little information, and 4-grams and larger give relatively limited yield for the additional space and/or calculation they require, though this varies somewhat with implementation (type) and the application.
 +
-->
 +
 +
===Skip-grams===
 +
 +
An extension of n-grams that allow allowing a certain amount of arbitrarily positioned omissions from the input.
 +
 +
Skip-grams come from speech analysis, processing phonemes.
 +
 +
In word-level analysis their purpose is a little different. You could say that we acknowledge the sparsity problem, and decide to get more out of the data we have (focusing on context) rather than trying to smooth.
 +
 +
 +
<!--
 +
Note that the amount of output tuples is a factor more than non-skipping n-grams, a bit more than proportional to the amount of allowed skips.
 +
 +
For example, for the <tt>this is a sentence</tt>, one of the outputs would be
 +
{{inlinecode|This}},{{inlinecode|is}},{{inlinecode|sentence}}
 +
 +
Slightly more formally, an k-skip-n-gram has n items, and allows skipping of k or fewer words (including 0-skip, i.e. straight n-grams).
 +
-->
 +
 +
<!--
 +
 +
A closer look at skip-gram modelling
 +
-->
 +
 +
<!--
 +
===Flexgrams===
 +
 +
A variation without predetermined length, and allowing gaps.
 +
 +
-->
 +
 +
==Continuous bag of words==
 +
<!--
 +
A word is used along with the adjacent ones in a given window size are used.
 +
 +
You could consider them unordered{{verify}} n-grams, though cbow seems more commonly used in neural-net approaches.
 +
 +
-->
 +
 +
==Sparse data smoothing==
 +
{{Attempt}}
 +
 +
 +
===The problem===
 +
{{stub}}
 +
 +
{{comment|(For relative brevity, the below omits details on training and validation, including terms like held-out, deleted interpolation. While important to execution, it's something you can look up when implementing)}}
 +
 +
 +
 +
Data sparsity is problematic for various statistical data models, such as simple [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood] models, such as [[bag-of-words]], [[n-gram]], [[HMM]], and others.
 +
 +
 +
For example, in the set of all possible word trigrams (for [[language modeling]] or such), most have not happened in many years of text, and probably never will.
 +
 +
In genera, when you base probability (or such) fairly directly on appearance count, you can get a few problems:
 +
* components that occur very few times - relative probabilities make less sense on the rare end of the scale. Consider two items: one that appears once, one that appears three times. How fair is it to say that one is 300% as probable as the other, particularly when comparing two things that are almost never happen?
 +
 +
* most of your trained data is spent on things you really don't care about
 +
 +
* items that were ''not'' in your training data / trained model. It's hard to say what you want in such a case;
 +
** If you take a zero count to mean zero probability, you say it will never happen and it should drop out completely, which is a decision too harsh for most goals
 +
** ''ignoring'' components of an item can be that are not in training data may work well enough, except you can run into problems where the unseen items consists significantly of unknown components (or where the unknowns would in reality be the more telling ones, if looked at more specifically). You would get more variation in your estimations than you'ld want.
 +
** misspellings and different word order can sometimes make likelihood estimations take an ''unnecessary'' nosedive
 +
 +
 +
 +
When we're talking about likelihood, all ways of smoothing over these problems amount to changing the [http://en.wikipedia.org/wiki/Probability_mass_function probability mass] distribution somehow.
 +
 +
The most basic solutions are to '''fake the unknowns''' by pretending they exist but are simply rare.
 +
Some may try to be cleverer and boost some more than others.
 +
 +
The probability mass shift is also the reason many of the names have '''discounting''' in them: You take some probability from the likelier items to give them to rarer things.
 +
Do it too eagerly and you overestimate the rare and underestimate common things.
 +
 +
In some ways, assuming very low probability for unseen items (using only items we have halfway decent information on) has roughly the same effect and flaws as ignoring them, e.g. the nosedive mentioned.
 +
Smarter fallbacks of likelihood, even if more approximate, are often preferable.
 +
 +
 +
 +
'''For n-gram models'''
 +
 +
The larger the n in the n-gram, the larger the problem, because the more likely it is that you don't have something in your training data - you could say you need exponentially more data to get a decent estimation for larger ''n''.
 +
For many types of data, n&ge;4 is already problematic.
 +
More data is good, but at some point a little fallback and smoothing is easier.
 +
 +
While not always mentioned along with smoothing, very low counts in count data are also a little weird. For example, if some n-gram appeard once and another twice, is it really twice as probable? (statistically, using very little information is problematic)
 +
 +
The smarter estimation (than assuming rareness) in such models often fall back onto a n-1 (one-piece-shorter) items.
 +
 +
===Various solutions===
 +
{{stub}}
 +
<!--
 +
In order of graspability?
 +
-->
 +
 +
 +
====Additive smoothing (add-one, add-&#948;)====
 +
 +
The simplest workaround to the zero-multiplication/zero-probability problem is '''add-one smoothing''', a.k.a. '''Laplace smoothing''', which adds one to every count (generally: to known_vocabulary + unknowns). In other words, pretend you saw everything a little more.
 +
 +
This is ''only'' really a fix to the zero-multiplication problem.
 +
The way it shifts the probability mass is not very natural - it gives a disproportional boost to rare things, and a lot of probability mass is taken away from unknowns when there a lot of unknowns.
 +
 +
It means underestimating commonly seen things somewhat, and overestimating unknowns (particularly relative to known but rare items).
 +
The effect depends on how much you add, and how high the counts are.
 +
 +
(In some methods this doesn't matter much because none of changes the order of the items, just the actual values. In some cases it will completely break things, though; it's a bad idea in n-gram models where you mix n-grams of different lengths)
 +
 +
 +
'''Add-&#948; smoothing''', a.k.a. '''Lidstone smoothing''', is a variation that adds a smaller number, something between 0 and 1 and possibly based on the trained data. It mainly just plays with the magnitude of the distribution shift.
 +
 +
If the value is 0.5 this is also known as Expected Likelihood Estimation (ELE) and the Jeffreys-Perks Law.
 +
 +
 +
 +
All of these amount to saying that unknowns do exist and are very unlikely (and has less effect on well-known items),
 +
and handle all unknowns identically.
 +
If you can estimate more cleverly than that you probably should - and in many underlying models that's not very hard.
 +
 +
Various (parts of) methods of cleverness are somewhat portable between specific models.
 +
 +
 +
====Absolute discounting====
 +
 +
 +
====Witten-Bell discounting====
 +
 +
Intuitively: Unseen n-grams are just ones that are unseen until now. So give it an estimation (probability of seeing a new n-gram like this?{{verify}}).
 +
 +
(Can be considered an instance of Jelinek-Mercer smoothing{{verify}})
 +
 +
 +
===='Good-Turing smoothing''' / '''Good-Turing Discounting====
 +
 +
Redistributes the probability mass.
 +
Assuming binomial distribution (though leaves better known things alone).
 +
 +
Its most basic definition is more of an idea, and not robust by itself, but a good base for some more advanced smoothers (involving backoff/interpolation)
 +
 +
See also:
 +
* http://en.wikipedia.org/wiki/Good-Turing
 +
 +
 +
 +
 +
 +
====Church-Gale smoothing====
 +
 +
Good-Turing plus bucketing.
 +
 +
 +
 +
 +
 +
 +
 +
 +
====Kneser-Ney Smoothing====
 +
 +
Probably the best available with the smoothing category, in that it uses few-count items cleverly effectively.
 +
 +
Similar to Witten-Bell in that it considers context, and can be considered an extension to absolute discounting
 +
 +
Works better with interpolation than with backoff
 +
 +
 +
 +
'''Modified Kneser-Ney Smoothing''' (introduced in Chen & Goodman) is an variation that seems to fairly consistently perform somewhat better.
 +
 +
===Combined estimators===
 +
 +
These aren't really just smoothing anymore, including many the variations of interpolation and backoff - which uses information on shorter items to estimate longer items with unseen parts.
 +
Intuitively: that data is necessarily not as sparse, and certainly a better estimate than additive smoothing.
 +
 +
 +
For example, if both "john likes" and "john spork" are unknown, then you can estimate that 'john likes' is more likely by using the unigram 'likes' and 'spork'. If the trigram ''john likes mnemonics'' is unseen, you may find you have the bigram bigram ''john likes''.
 +
While not actually what you asked, it's still a much better estimation than additive smoothing gives you, particularly since that would happen a lot for larger ''n''.
 +
 +
 +
 +
'''Backoff''': if there is no information on a given n-gram, use (n-1)-grams within it.
 +
 +
'''Interpolation''': Estimate probability of ''every'' n-gram is estimated using (n-1)-grams - which is a recursive definition. It's also a relatively general idea that can be and is tweaked in many ways.
 +
 +
These are sensible ways of using larger n-grams without necessarily causing a huge shift in probability mass.
 +
 +
It's not very hard to alter a backoff algorithm into an interpolating one, and vise versa, though most algorithms are primarily seen as just one.
 +
 +
 +
 +
'''(Simple) Linear interpolation''' is a weighed combination of basic maximum likelihood and use of lower-order n-grams (a recursive definition).
 +
 +
 +
We want to weigh each n-gram based on training data - and we want to train that from a data set distinct from that used for ML training.
 +
The common (named) choices:
 +
 +
 +
Even with deleted interpolation, you won't easily get enough data.
 +
 +
 +
'''Jelinek-Mercer smoothing''' - interpolating method that tries to get more coverage using further fuzziness, namely bucketing (), based on counts. Which can help even fairly small training sets perform decently.
 +
 +
There are variations that use different bucketing criteria
 +
 +
 +
 +
Apparently not used much anymore; there are simpler options that
 +
 +
 +
 +
Katz smoothing, a.k.a. Katz backoff (n-gram based) -
 +
Extends Good-Turing redistributing counts but using different-order models (can be said to be backoff).
 +
Often introduced for bigrams and unigrams, but can be taken recursively.
 +
 +
Reverts to use shorter n-gram context if count for the n-gram is lower than a given threshold.
 +
 +
 +
 +
 +
 +
Things that need to reserve data (Jelinek-Mercer, Witten-Bell, ) are not used so much anymore, because there are methods that don't require it (e.g. Kneser-Ney), which tend to work better.
 +
 +
-->
 +
 +
 +
See also:
 +
* http://en.wikipedia.org/wiki/Additive_smoothing
 +
* http://en.wikipedia.org/wiki/Pseudocount
 +
 +
* http://en.wikipedia.org/wiki/Katz%27s_back-off_model
 +
* http://en.wikipedia.org/wiki/Good-Turing
 +
 +
 +
===Unsorted===
 +
 +
 +
====Class-based smoothing====
 +
<!--
 +
* (clustering)
 +
-->
 +
 +
====Absolute discounting====
 +
 +
 +
====Held-out estimation====
 +
 +
 +
===See also===
 +
* Chen & Goodman, ''An Empirical Study of Smoothing Techniques for Language Modeling'' [http://www.google.com/search?q=An%20Empirical%20Study%20of%20Smoothing%20Techniques%20for%20Language%20Modeling]
 +
** ...its references
 +
 +
 +
<!--
 +
To look at:
 +
 +
http://www.scribd.com/doc/50891038/34/Witten-Bell-smoothing
 +
 +
http://www.ee.columbia.edu/~stanchen/e6884/labs/lab3/x207.html
 +
 +
http://www.isip.piconepress.com/publications/courses/msstate/ece_8463/lectures/current/lecture_33/lecture_33_04.html
 +
 +
http://www.speech.sri.com/pipermail/srilm-user/2004q4/000245.html
 +
 +
-->
 +
 +
 +
[[Category:Math on data]]
 +
 +
 +
 +
 +
[[Category:Algorithms]]
 +
[[Category:Math on data]]
 +
[[Category:Machine Learning]]
  
 
=Unsorted=
 
=Unsorted=

Revision as of 09:54, 28 June 2020

This is more for overview of my own than for teaching or exercise.

Overview of the areas

Arithmetic · 'elementary mathematics' and similar concepts
Set theory, Category theory
Geometry and its relatives · Topology
Elementary algebra - Linear algebra - Abstract algebra
Calculus and analysis
Logic
Semi-sorted
 : Information theory · Number theory · Decision theory, game theory · Recreational mathematics · Dynamical systems · Unsorted or hard to sort


Math on data:

  • Statistics as a field
some introduction · areas of statistics
types of data · on random variables, distributions
Virtues and shortcomings of...
on sampling · probability
glossary · references, unsorted
Footnotes on various analyses

Other data analysis, data summarization, learning

  • Data modeling, restructuring, analysis, fuzzy cases, learning
Data massage
Data clustering · Dimensionality reduction · Fuzzy coding, decisions, learning · Optimization theory, control theory
Connectionism, neural nets · Evolutionary computing



This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Contents

Concepts & glossary

The curse of dimensionality is, roughly, the idea that when you add a dimension to your model you need proportionally more data for decent training of that model. Similarly, since the volume increases so fast, you probably have a sparsity problem. It's a fairly exponential problem, so



Stochastic processes, deterministic processes, random fields

A deterministic process deals with possible determined cases, no unknowns or random variables.


A stochastic process (a.k.a. random process) allows indeterminacy, typically by working with probability distributions.

A lot of data is stochastically modeled, because you only need partial data and can generally only get partial data.


(Models mixing deterministic and stochastic processes are often called hybrid models)


A random field basically describes the generalization that happens when the parameter (dependent variable) is not necessarily time, or one-dimensional, or real-valued.


See also:


Types of problems

Tasks are often one of:

  • Clustering points out regions or groups of (mutual) similarity, and dissimilarity from other groups.
clustering may not deal well with future data of the same sort, unless some care has been taken, so may not be the best choice of a learning/predicting system
  • Vector quantization: Discretely dividing continuous space into various areas/shapes
which itself can be used for decision problems, labeling, etc.
  • Dimensionality reduction: projecting attributes into lower-dimensional data
where the resulting data is (hopefully) comparably predictive/correlative (compared to the original)
The reason is often to eliminate attributes/data that may be irrelevant or too informationally sparse
see also #Ordination.2C_Dimensionality_reduction.2C_Factor_Analysis.2C_Multivariate_analysis
  • Feature extraction: discovering (a few nice) attributes from (many) old ones, or just from data in general.
Often has a good amount of overlap with dimensionality reduction
  • others...

Markov property

the Markov property is essentially that there is no memory, only direct response: that response of a process is determined entirely by its current state (and input, if you don't already define that as part of the state).

More formally, "The environment's response (s,r) at time t+1 depends only on the Markov state s and action a at time t" [1]


There are many general concepts that you can make stateless, and thereby Markovian:

  • A Markov chain refers to a Markov process with finite, countable states [3]
  • A Markov random field [4]
  • A Markov logic network [5]
  • A Markov Decision Process (MDP) is a decision process that satisfies the Markov property
  • ..etc.


Observability and controllability

Underfitting and overfitting (learners)

Underfitting is when a model is too simple to be good at describing all the patterns in the data.

Underfitted models and learners may still generalize very well, and that can be intentional, e.g. to describe just the most major patterns.

It may be hard to quantify how crude is too crude, though.


Overfitting often means the model is allowed to be so complex that a part of it describes all the patterns there are, meaning the rest ends up describing just noise, or insignificant variance or random errors in the training set.

A little overfitting is not disruptive, but a lot of it often is, distorting or drowning out the parts that are actually modeling the major relationships.

Put another way, overfitting it is the (mistaken) assumption that convergence in the training data means convergence in all data.


There are a few useful tests to evaluate overfitting and underfitting.


supervised versus unsupervised systems (learners)

Supervised usually means the training process is suggested or somehow (dis)approved. Usually it refers to having annotated trainign data, sometimes to altering it. Example: Classification of documents, basic neural network back-propagation, least-squares fitting, operator cloning

Unsupervised refers to processes that work without intervention. For example, self-organizing maps, or clustering documents based on similarity needs no annotation.

Semi/lightly supervised usually means there is an iterative process which needs only minimal human intervention, be it to deal with a few unclear cases, for information that may be useful, or such.


Inductive versus deductive systems (learners)

Inductive refers to training purely from data.

Deductive refers to also having a theory about the domain domain.


Inductive learning can be approached as a search in a hypothesis space.

Model-based versus model-free systems (learners)

Statistical modeling, data fusion (and related estimation)

Data fusion means merging data from various sources, e.g. various sensors. This often implies some sort of modelling.


Regression analysis

Regression glossary:

  • linear regression models a linear predictor function
typically a least squares estimator
  • simple regression indicates a single independent/explanatory variable

Linear regression

Logistic regression

Mean Squared Error (metric)

Bayes estimator

Kalman filter

Data clustering

Clustering groups a few related sub-problems, including

cluster formation - organizing into clusters
cluster segmentation - dealing with boundaries (often using cluster centers)
labeling - assigning meaningful names
for the (relatively few) cases where this makes sense to estimate
deciding how many groups to have in the result
evaluation of a solution (possibly feeding back into the previous point)


Formally, the simplest clustering ca be described as:

  • you have a set of n data objects, call it D = { d1, ..., dn }
  • in its simplest shape, a clustering result is a disjoint partitioning of D
  • which makes clustering itself a function, mapping each datapoint to a cluster number/label that indicate membership of said cluster


The input data is often either

  • a set of a points in a many-dimensional space, usually a vector space, plus a metric to calculate distance between them, OR
  • a set of already-chosen distances
preferably complete, but depending on how it's made it might be sparse, and it may be easier to do some fuzzy statistical estimation than to ask people to complete it


The latter may well be in the form of a distance matrix / similarity matrix.

A bunch of methods given datapoints-plus-metric convert to that internally, but starting data-plus-metric is often a little more flexible up front - both for data massaging, and sometimes for implementation reasons.


That said, the choice of metric takes care, because there are many ways to accidentally put some bias into the metric.


Many methods look at element-to-element similarities/dissimilarities, while a few choose to be more involved with the data that comes from (e.g. some Maximum-likelihood-based methods common in bioinformatics)


Variations

Hard, soft, and fuzzy clustering

Hard clustering means each item should be assigned to a single group. This is essentially a partitioning.


Soft clustering means something can belong to more than one group.

Regularly used for data known to be too complex to be reduced cleanly with hard clustering, such as when there are closeby, overlapping, or ambiguous groups.

Soft clustering is generally understood as boolean soft clustering: something can belong to one or more clusters, but there are no degrees.


Fuzzy clustering is soft clustering plus degrees of membership.

This means intermediate results are effectively still moderately high-dimensional data, you often still have to make a decision about exclusion, thresholds or such (preferably within the algorithm, to have all information available).

If you don't make such a decision, the result more resembles dimensionality reduction.

Agglomerative versus divisive

Agglomerative clustering usually starts with each item in its own cluster and merges them where it seems a good idea.

Divisive clustering usually starts with every item in a single cluster and iteratively splits them as it sees fit.

The difference seems to lie largely in what side they err on in unclear cases.(verify)


Hierarchical clustering

Hierarchical clustering creates a tree of relations, often by an process where we keep tracks of how things join, rather than just assimilate things into a larger blob.


Hierarchical clusterers can be flexible, in that their results can partition into an arbitrary number of groups (by choosing the depth at which there are that amount of groups).

Depending on the data these results contain may also be useful as an approximation for fuzzy clustering. They may also be a little more helpful in cluster stability tests.


Some algorithms record and retain he distances of (/stress at) each such joint. These can be interesting to visualize (think dendrograms and such), and to effectively allow the amount-of-cluster choice to be made later (think threshold in a dendrogram).


http://gepas.bioinfo.cipf.es/cgi-bin/docs/clusterhelp


Notes on....

Group number choice

Some algorithms try to decide on a suitable number of target groups, but many require you to choose an exact number before they get started.

This number is difficult to decide since there is usually is no well-defined, implicit, calculable best choice.


This is a problem particularly in hard clustering, because any decision of group membership is very final. The membership of bordercases may not be stable under even the slightest amount of (sample) noise.


Things you can do include:

  • use evaluation to measure the fitness of a solution (or sub-solutions while still clustering), based

Note that such a metric in itself is only a relative value in a distribution you don't know - you'll often have to calculate the fitness for many solutions to get a still-vague idea of fitness.

  • In the case of hard clustering, you can intentionally add some noise and see how much the membership of each item varies - and, say, report that as the confidence we have in a choice.
  • use some type of cross-validation


Inter-cluster and intra-cluster comparisons; susceptibilities

Depending on the algorithm, you often want to be able to compare

  • items to clusters (agglomerative and divisive decisions)
  • clusters to clusters (e.g. in hierarchical decisions)
  • items to items (e.g. for centroid/medoid decisions)


Cluster-to-cluster distances are most interesting to hierarchical clustering, and can be calculated in a number of ways (usually hardwired into the algorithm), including:

  • Single-link, a.k.a. minimum linkage or nearest linkage
the most similar combination (lowest distance) of possible comparisons
more susceptible to over-chaining than most other methods
  • Complete-link , a.k.a. max/farthest linkage
uses the least similar (largest distance) combination of possible comparisons
often gives a non-chained, more equally divided clusters than single-link
outliers may have disproportional influence
  • Average-link (a.k.a. group-average, a.k.a. Group-average (agglomerative) clustering (GAAC)) - average of distances between all inter-cluster pairs
less sensitive to outliers than complete-link, less sensitive to inversion than centroid approaches.
  • Centroid approaches use a calculated average for comparison to a cluster
quite susceptible to inversions (verify)
  • Medoid approaches try to use a representative item for comparison to a cluster
somewhat susceptible to inversions (verify)
  • Density-based methods care more about local density of items and less directly about the exact distances involved
  • Ward's method
based on Ward criterion, a.k.a. the Ward minimum variance criterion



Potential problems:

  • outliers
    • including a single outlier may drastically change comparisons to that group.
    • The timing of their inclustion can have significant effects on the result.
  • the chaining effect refers to algorithms doing a chain/string of assignments of to a group. The concept is clearest in
  • Inversions (sometimes 'reversal' or 'non-monotonicity') - describes when similarity values do not decrease monotonously in a series of iterations
    • easily happens when a process makes decisions based on centers that move in the process of clustering (such as in many centroid-style processes) particularly when combined with cases where there is no clear clustering solution.


See also:

TODO: read

on convergence
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Convergence is a nontrivial check in many algorithms.

You could check whether the group assignments have not changed, but this is sensitive to oscillations, resulting in a premature report of convergence and/or a failure to converge (depending somewhat on logic and data size).

A simple threshold is arbitrary since the error values often depend on the scale of the data values (which is not very trivial to correct for(verify)).


This is occasionally solved by error minimization criteria, for example minimization-of-the-sum-of-squares.

There are some details to this. For example, reallocating a point between clusters, various methods consider only the error decrease in the target cluster - while the solution's total error may increase. It usually still converges, but the total error decreases with a little more oscillation, which "no significant improvement in the last step" terminating criterion may be sensitive to (though arguably it's always more robust to check whether the error decrease is roughly asymptotic with the minimum you presume it'll get to).

The idea resembles Expectation Maximization (EM) methods in that it tries to maximize the probability of the clusters being the correct by minimizing the energy/error.


Purely random initial positioning may cause the local minimum problem. Smartly seeding the initial centroids helps and need not be too computationally expensive - and in fact helps convergence; see e.g. k-means++.

Alternatively, you could run many versions of the analysis, each with random initial placement, and see see whether (and/or to which degree) the results are stable, but this can be computationally expensive.

Robustness in hard clustering

Particularly hard clusterers are often not robust against even minor variations in the data. That is, separate data sets that are highly correlative may lead to significantly different results; areas in which membership is borderline flip-flip under the tiniest (sample) noise.


You can evaluating a solution for stress (or correlate distances it implies to the original data e.g. in hierarhical data), though in itself this is only a general thing. It is meaningful the same measre from other clusterings of the same data, meaning that you *can* roughly compare different solutions for expression of the original data, but only in a roughly converging way.


One trick is to cause the problem and test how varied results will be over mild variations over the data. You can for example repeat the clustering some amount of times with some noise, and record how often things change membership.

You can repeat the clustering omitting random pieces of data to lessen the effect of outliers - pieces of data that do not agree with the rest.

You can even aggregate the results from these runs and combine them into a sort of fuzzy cluster result that can show you instabilities, and/or converge on a clusters amount choice.

Evaluation
Silhouette coefficients
Davies-Bouldin index
Gap statistic
Unsorted

Implementation notes

k-means

The k-means problem is finding a cluster labelling for a given amount of clusters (k) with minimal error, where the error function is based on the the within-group sum of squares.

(For completeness, that means for all elements in a group, calculate the square of the euclidean distance to the centroid, and sum up all these squares, which gives per-cluster error values. Various convergence checks will want to know the sum of these errors)


Most implementations are iterative and look something like:

  • Position k cluster centroids (at random, or sometimes slightly more cleverly)
  • For each element, assign to the nearest centroid
  • Recalculate (affected) centroid means (and often the error/energy at the same time)
  • Check whether the moved centroids change the element assignments.
If so, iterate
If no change, we have converged and can stop


Of iterative clustering methods, k-means is the simplest and many others can be said to be based on it.


K-means gives better better results if the value for k is a good choice, representative for the data. (it's not unusual to try various k and test them)

The common 'nearest cluster' criterion will avoid attraction of multiple clusters, and the whole will converge to a decent solution for k groups.


Limitations:

  • results are sensitive to initial placement, and it is easy to get stuck in a local minimum.
It is not unusual to run the clustering various times with different starting clusters and see how stable the clustering is.
  • If k is not representative of the structure in the data the solution may not be satisfying at all. This is partly caused by, and partly independent of, the fact that there may be various possible stable clusterings.
  • the simple distance metric means we say the shape for inclusion is always a circle


Average-case runtime is decent because it's a fairly simple algorithm. Worst-case runtime, for fairly pathological datasets, is fairly quite high. There are faster approximations of k-means that you may wish to consider.

You can tweak k-means in various ways. For example, you can assign weights (based of frequency, importance, etc) to elements to affect the centroid mean recalculation.


See also:


Variations:

  • ISODATA (Iterative Self-Organising Data Analysis Technique Algorithm) builds on k-means and tries to be smart in initial positions and in variations of k(verify).
  • H-means is a variation on k-means that recalculates the centroid only after a complete iteration over all the items, not after each reassignment.
Seen one way, it checks for error decrease less often, which makes it a smidge more sensitive to local minima and perhaps doesn't converge as nicely.
Although the difference in practice tends to be slight, k-means tends to be the slightly safer (and much more common) choice, even if the order it handles elements in is a different kind of bias that h-means avoids.
Canopy clustering

https://en.wikipedia.org/wiki/Canopy_clustering_algorithm

Bisecting k-means
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


hard c-means
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


UPGMA, WPGMA
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

UPGMA = Unweighed Pair Group Method using Arithmetic averaging

WPGMA = Weighed Pair Group Method using Arithmetic averaging

(These specific names/abbreviations come from Sneath and Sokal 1973)


UPGMA assigns equal weight to all distances:

D((u,v),w) = (nu*D(u,w) + nv*D(v,w)) / (nu+nu)

WPGMA uses:

D((u,v),w) = (nu*D(u,w) + nv*D(v,w)) / 2


In the unweighed variant, the two things being combined weigh equally, in the weighed variant, all leaves that are part of a cluster weigh in as much as the other part.


Bottom-up combiners working from a difference matrix, combining whatever leaf/cluster distance is minmal, then recalculating the difference matrix.


It's not too hard too argue this terminology is a little arbitrary

UPGMC, WPGMC
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Unweighed Pair Group Method using Centroids, and Weighed Pair Group Method using Centroids

(the specific names/abbreviations come from Sneath and Sokal 1973)

Fuzzy c-means (FCM)
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
  • Type: fuzzy clustering (not really partitioning anymore)

Method: Like k-means, but weighs centroid recentering calculations by fuzzy distance to all data points

http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/cmeans.html

http://www.scholarpedia.org/article/Fuzzy_C-Means_Cluster_Analysis

Fuzzy k-means
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

http://www.know-center.tugraz.at/forschung/wissenserschliessung/downloads_demos/fuzzy_k_means_and_k_means_clustering_demo

Shell clustering
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Basic fuzzy c-means would include by a radius - a sphere.

There are various other options, including:

  • fuzzy c-quadric shells algorithm (FCQS) detects ellipsoids
  • fuzzy c-varieties algorithm (FCV) detects infinite lines (linear manifolds) in 2D
  • adaptive fuzzy c-varieties algorithm (AFC): detects line segments in 2D data
  • fuzzy c-shells algorithm (FCS) detects circles
  • fuzzy c-spherical shells algorithm (FCSS) detects circles
  • fuzzy c-rings algorithm (FCR) detects circles
  • fuzzy c-rectangular shells algorithm (FCRS) detects rectangles
  • Gath-Geva algorithm (GG) detects ellipsoids
  • Gustafson-Kessel algorithm (GK) detects ellipsoids of roughly the same size
PAM

Partitional, medoid-based

CLARA

Partitional, medoid-based


AGNES

Hierarchical

AGglomerative NESting

DIANA

Hierarchical

DIvisie ANAlysis

Buckshot

Hybrid

Birch

Hybrid

BITCH (balanced iterative reducing and clustering using hierarchies)

https://en.wikipedia.org/wiki/BIRCH

Cure

Hybrid

Clustering Using REpresentatives

https://en.wikipedia.org/wiki/CURE_algorithm

Rock

Hybrid


"Rock: A robust clustering algorithm for categorical attributes"

Chameleon

Hybrid, Hierarchical


"CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling"

DBSCAN

Density-based.

The assumption that real objects will always be a dense cloud of points more easily rejects random points as noise/outliers (even if relatively close).

It can also deal decently with closeby nonlinear clusters, if separated cleanly.


http://en.wikipedia.org/wiki/DBSCAN

OPTICS

Similar to DBSCAN

https://en.wikipedia.org/wiki/OPTICS_algorithm

FLAME

Fuzzy, density-based

http://en.wikipedia.org/wiki/FLAME_Clustering


Clustering by Committee (CBC)

Hybrid


Based on responsive elements (a comittee) voting on specific outcomes.


See also:


Principal Direction Divisive Partitioning (PDDP)
Information Bottleneck

See also:


Agglomerative Information Bottleneck

See also:

Expectation Maximization Clustering

Related fields

See also

Dimensionality reduction

As a wide concept

Dimensionality reduction can be seen in a very wide sense, of creating a simpler variant of the data that focuses on the more interesting and removes the less interesting.


Note that in such a wide sense a lot of learning is useful as dimensionality reduction, just because the output is a useful and smaller thing, be it clustering, neural networks, whatever.

But in most cases it's also specific output, minimal output for that purpose.


Dimensionality reduction in practice is much more mellow version of that, reducing data to a more manageable form. It historically often referred to things like factor analysis and multivariate analysis, i.e. separating out what seem to be structural patterns, but not doing much with them yet, often acknowledging that we usually still have entangled surface effects in the observations given, and our direct output is probably still a poor view of the actual dependent variables that created the observations we look at. (Whether that's an issue or not depends largely on what it's being used for)


Reducing the amount of dimensions in highly dimensional data is often done to alleviate the curse of dimensionality.[7]

Said curse comes mainly from the fact that for every dimension you add, the implied volume increases quicker and quicker.

So anything that wants to be exhausitve is now in trouble. Statistics has an exponentially harder job of proving significance (at least without exponential amounts more data), Machine learning needs as much more data to train well, optimization needs to consider all combinations of dimensions as variables, etc.


Distance metrics are funnier, in that the influence of any one dimension becomes smaller, so differences in metrics smaller - and more error-prone in things like clustering. (verify)

It's not all bad, but certainly a thing to consider.


Ordination, Factor Analysis, Multivariate analysis

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

(See also Statistics#Multivariate_statistics)


Ordination widely means (re)ordering objects so that similar objects are near each other, and dissimilar objects are further away.


Ordination methods are often a step in something else, e.g. be good data massage before clustering.

It can become more relevant to higher-dimensional data (lots of variables), in that ordination is something you watch (so sort of an implicit side effect) in the process of dimensionality reduction methods.

Of course, different methods have their own goals and focus, different requirements, and their own conventions along the way.

Typical methods include PCA, SVD, and




Factor Analysis, Principal Component Analysis (PCA), and variants

Correspondence Analysis (CA)

Conceptually similar to PCA, but uses a Chi-square distance, to be more appicable to nominal data (where PCA applies to continuous data).


See also:

Multi-dimensionsional scaling (MDS)

Refers to a group of similar analyses, with varying properties and methods, but which all focus on ordination

Commonly mentioned types of MDS include:

  • Classical multidimensional scaling, a.k.a. Torgerson Scaling, Torgerson–Gower scaling.
  • Metric multidimensional scaling
  • Non-metric multidimensional scaling


It is regularly used to provide a proximity visualization, so the target dimensions may be two or three simply because this is easier to plot.


Depending on how you count, there are somewhere between three and a dozen different MDS algorithms.

Some MDS methods closely resemble things like PCA, SVD, and others in how they change the data. Some more generic MDS variants are more on the descriptive side, so can be solved with PCA, SVD, and such.


A ordination, most try to not change the relative distances, but do change the coordinate system in the process.


Input

Input to many method is a similarity matrix - a square symmetric matrix often based on a similarity metric. Note some similar methods may be based on dissimilarity instead.


At a very pragmatic level, you may get

  • a ready-made similarity matrix
  • items plus a method to compare them
  • a table of items versus features (such as coordinates, preferences), with a method to compare them
  • perceived similarities between each item (e.g in market research)

There is little difference, except in some assumption like whether the feature values are Euclidean, independent, orthogonal, and whatnot.


Result evaluation

MDS solutions tend to be fairly optimal, in that for the amount of target dimensions you will get the solution's distances correlating to the original data's distance's as well as they can.

There are still a number of things that help or hinder accuracy, primarily:

  • the choice of input values
  • the choice of target dimensions (since too few lead to ill placement choices)
  • (to some degree) the type of MDS
  • methods that have cluster-like implementations may be sensitive to initial state


You probably want to see how good a solution is.

The simplest method is probably calculating the correlation coefficient between input (dis)similarity and resulting data (dis)similarity, to show how much the MDS result fits the variation in the original. By rule of thumb, values below 0.6 mean a bad solution, and values above 0.8 or 0.9 are pretty good solutions (depending on the accuracy you want, but also varying with the of MDS).

Other methods include Kruskal's Stress, split data tests data stability tests (i.e., eliminating one item, see if result is similar) test-retest reliability [8]


Algorithms

Note that in general, correlation can be complete (if k point in k-1 dimensions, or distances between any two items are equal whichever way they are combined, e.g. if distances are those in euclidean space), but usually is not. The output's choice of axes is generally not particularly meaningful.


The most common example is principal coordinates analysis (PCO, PCoA), also known as Classical multidimensional scaling, Torgerson Scaling and Torgerson-Gower scaling, which is a single calculation step so does not require iteration or convergence testing.

See also (Torgerson, 1952), (Torgerson, 1958), (Gower, 1966), CMDSCALE in R

(Note: PCO (Principle Coordinate analysis) should not be confused with PCA (Principle Component Analysis) as it is not the same method, although apparently equivalent when the PCA kernel function is isotropic, e.g. is working on Euclidean coordinates/distance)(verify)


The first dimension in the result should capture the most variability (first principal coordinate), the second the second most (second principal coordinate), etc. The eigenvectors of the input distance matrix yield the principal coordinates, and the eigenvalues give proportion of variance accounted for. As such, eigenvalue decomposition (or the more general singular value decomposition) can be used for this MDS method. (The degree to which distances are violated can be estimated by how many small or negative eigenvalues. If there are none (...up to a given amount of dimensions...) then the analysis is probably reasonable), and you can use the eigenvalues to calculate how much of the total variance is accounted for(verify) - and you have the option of choosing afterwards how many dimensions you want to use (which you can't do in non-metric).




Metric (multidimensional) scaling: a class of MDS that assumes dissimilarities are distances (and thereby also that they are symmetric). May be used to indicate PCO, but is often meant to indicate a class based on something of a generalization, in that the stress function is more adjustable. The optimization method used is often Stress Majorization (see also SMACOF [9] (Scaling by Majorizing A COmplicated Function)).




On iterative MDS methods: minimize stress no unique solution (so starting position may matter)

  • stress-based MDS methods
  • may be little more than non-parametric version of PCO(verify)



Non-metric multidimensional scaling can be a broad name, but generally find a (non-parametric) monotonic relationship between [the dissimilarities in the item-item matrix and the Euclidean distance between items] and [the location of each item in the low-dimensional space].

The optimization method is usually something like isotonic regression (which is due to monotonicity constraints). Methods regularly have both metric and non-metric parts, and non-metric scaling in the broad sense can describe quite varying methods (see e.g. Sammon's NLM).

Note that the the monotonic detail means that ranking of items ends up as more important than the (dis)similarities. This may be a more appropriate way of dealing with certain data, such as psychometric data, e.g. ratings of different items on an arbitrary scale.

Non-metric MDS may give somewhat lower-stress solutions than metric MDS in the same amount of dimensions.(verify)

Also, certain implementations may deal better with non-normality or varying error distributions (often by not making those assumptions).


Examples:

  • Kruskal's non-metric MDS(verify)
  • Shepard-Kruskal Scaling(verify) (and (verify) whether that isn't the same thing as the last)
  • Sammon non-linear mapping [10]


Variations of algorithms can be described as:

  • Replicated MDS: evaluates multiple matrices simultaneously
  • Three-way scaling
  • Multidimensional Unfolding
  • Restricted MDS




Multidimensional scaling (MDS) [11]

  • classical MDS (quite similar to PCA under some conditions, apparently when you use euclidean distance?)
    • (Difference matrix -> n dimensional coordintes (vectors))
  • Kruskal's non-metric MDS (R: isoMDS)
  • Sammon's non-linear mapping (R: sammon)


See also
  • WS Torgerson (1958) Theory and Methods of Scaling
  • JB Kruskal, and M Wish (1978) Multidimensional Scaling
  • I Borg and P Groenen (2005) Modern Multidimensional Scaling: theory and applications
  • TF Cox and MAA Cox (1994) Multidimensional Scaling


Generalizized MDS (GMDS)

A generalization of metric MDS where the target domain is non-Euclidean.

See also:


Sammon’s (non-linear) mapping

Singular value decomposition (SVD)

See also:

Manifold learning

An approach to non-linear dimensionality reduction.

Can be thought of an extension of PCA-style techniques that is sensitive to non-linear structure in data.


Expectation Maximisation (EM)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A broadly applicable idea/method that iteratively uses the Maximum Likelihood (ML) idea and its estimation (MLE).


Fuzzy coding, decisions, learning

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


On bias

Methods / algorithms / searchers

Decision trees

ID3
Pruning (ID3, others)
Rule Post-pruning; C4.5

Instance-based learning

Bayesian learning

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Bayesian learning is a general probablistic approach, mostly specifically used as a probablistic classifier.

Mathematically it is based on any observable attribute you can think of, and the math requires Bayesian inversion (see below).

Many basic implementations also use the Naive Bayes assumption (see below), because it saves a lot of computation time, and seems to work almost as well in most cases.

Bayesian classifier

Bayes Optimal Classifier
Naive Bayes Classifier

Bayesian (Belief) Network

Some classifiers

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
  • Parzen classifier
  • Backpropagation classifier


Evaluating classifiers

Kernel(-related) methods, the Kernel Trick

Support Vector Machines

Markov Models, Hidden Markov Models

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Something like (the simplest possible) Bayesian Belief Networks, but geared to streams of data. Can be seen as a state machine noting the likeliness of each next step based on a number of preceding steps.

The hidden variant only shows its output (and hides the model that produces it), the non-hidden one shows all of its state.

Simple ones are first-order


Optimization theory, control theory

Glossary

Some controllers

In terms of the (near) future:

  • greedy control doesn't really look ahead.
  • PID can be tuned for some basic tendencies
  • MPC tries to minimize mistakes in predicted future


  • For example, take a HVAC system that actively heats but passively cools. This effectively means you should be very careful of overshooting. You would make the system sluggish -- which also reduces performance because it lengthens the time of effects and settling


Non-linear:

  • HVAC


Greedy controllers

Doesn't look ahead, just minimizes for the current step.


For example basic proportional adjustment.

Tends not to be stable.

Can be stable enough for certain cases, in particular very slow systems where slow control is fine, and accuracy not so important.


For example, water boilers have such large volume that even a bang-bang controller (turn heater element fully on or off according to temperature threshold) will keep the water within a few few-degrees of that threshold, simply because the water's heat capacity is large in relation to the heating element you'ld probably use.


But in a wider sense, e.g. that same boiler with a small volume, or powerful heater, will mean such control causes unproductive feedback, e.g. oscillations when actuation running is about as fast or faster than measurement.


Hysteresis (behaviour)

Hysteresis control (type)

Map-based controller (type)

PID controller

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

PID is a fairly generic control-loop system, still widely used in industrial control systems.

It is useful in systems that have to deal with delays between and/or in actuation and sensing, where they can typically be tuned to work better than greedy controllers (and also be tuned to work worse), because unlike greedy, you can try to tune out overshoots as well as oscillations.

PID is computationally very cheap (a few adds and multiplies per step), compared to some other cleverer methods.


Yet:

  • There are no simple guarantees of optimality or stability,
you have to tune them,
and learn how to tune them.
  • tuning is complex in that it depends on
how fast the actuation works
how fast you sample
how fast the system changes/settles
  • doesn't deal well with long time delays
  • derivative component is sensitive to noise, so filtering may be a good idea
  • has trouble controlling complex systems
more complex systems should probably look to MPC or similar.
  • linear at heart (assumes measurement and actuation are relatively linear)
so doesn't perform so well in non-linear systems
  • symmetric at heart, so not necessarily well-suited to non-symmetric actuation
consider e.g. a HVAC system -- which would oscillate around its target by alternately heating and cooling.
It is much more power efficient to do one passively, e.g. active heating and passive cooling (if it's cold outside), or active cooling and passive heating (if it's warmer outside)
means it's easier to overshoot, and more likely to stick off-setpoint on the passive side, so on average be on one side
You could make the system sluggish -- in this case it reduces the speed at which it reaches the setpoint, but that is probably acceptable to you.
in other words: sluggish system and/or a bias to one side


Some definition
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The idea is to adjust the control based on some function of the error, and a Proportional–Integral–Derivative (PID) controller combines the three components it names, each tweaked with their own weight (gain).

The very short version is that

  • P adjusts according to the proportional error
  • I adjusts according to the integrated error
  • D adjusts according to the derivative error


It can be summarized as:

PID formula.png

where

  • e(t) is the error
  • P, I, and D are scalar weights controlling how much effect each component has


Some intuition
So how do you tune it?

MPC (Model Predictive Control)

FLC (Fuzzy Logic Control)

Notes on

Gradient-based learning

See also

Log-Linear and Maximum Entropy

Reinforcement learning (RL)

See also:

  • LP Kaelbing, ML Littman, AW Moore. (1996) Reinforcement learning: A survey.
  • RS Sutton, AG Barto (1998) Reinforcement Learning: An Introduction


Classifiers

Combining classifiers

Connectionism

Connectionism is the idea that interesting things are emergent effects from interconnected networks of simple units.

The term is used in cognitive psychology, cognitive science, neuroscience, and philosophy of mind, to model mental and behavioral phenomena -- but most recognizably in artificial intelligence in the form of neural networks.

Another angle is computationalism, which works more on higher-level expression, and tends to think connectionism is no more than lower-level association. The two are not necessarily incompatible, but certainly distinct approaches.


Neural networks

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Terminology old and new

ANN history

Perceptrons

Feedforward nets

RNNs

CNNs

Deep?

Unsorted

Evolutionary computing

See also


Semi-sorted

Feature model notes

Bag of words/features (assumption)

The bag-of-words model (more broadly bag-of-features model) use single words/features as an (unordered) set.

The implied assumption are that the features are independent, so order and adjacency is unimportant.

From the n-gram perspective, unigram models amount to this.

You can also see it as a voting system.


In text processing

One common example is Naive Bayes spam filtering , because its naivety essentially is this assumption that feature order does not matter.

Though real-world naive bayes spam filtering would take more complex features than single words (and may re-introduce order via n-grams or such), simple example versions may use 1-grams.

Other types of classifiers also make this assumption, or make it easy.


In image recognition / image search

The bag-of-features refers to models that recognize parts of a whole, and used in a number image tasks, such as feature extraction, object/image recognition, image search, (more flexible) near-duplicate detection, and such.

The idea that you can describe an image by the collection of small things we recognize in it, and that combined presence is typically already a strong indicator (particularly when you add some hypothesis testing). Exact placement can be useful, but often easily secondary.

See also:

Naive Bayes

Some background

A certain Reverend Thomas Bayes said a number of things about probability, used mostly in Bayesian inference in statistics, and in Naive Bayes classification.


Classification, in general, is usually based on a trained model:

  • deciding on a set of distinct target classes (say, exclusives as 'spam' vs. 'not spam', or rough article subjects (preferably exclusive, or classification itself does not make that much sense, although the value-for-class assignments still may)
  • deciding some features to look at, that classification will be based on
  • learning the importance of features to each class (often using good examples for each class)

Given something unseen to classify, you extract its features the same way you did for the training set, and see which class compares best. Classification is regularly many fuzzy measures of fit followed by a hard choice of which seems best.


Naive Bayes refers to one of the simpler algorithms that does classification this way.

Its definition has a few notational conventions:

  • Each training class is referred to as c in the set of classes C (c1...ci)
  • We eventually want to judge an unseen document D
  • and what we want for D is is the class for which the probability the document fits that class is highest


You can express such Bayesian classification as:

the c for which P(c|D) is highest 

or, a bit more formally yet,

c = maxarg( P(ci|D) )

Bayesian Inversion

The P(c|D) above is not obvious to calculate, because it immediately asks us for a best choice for a document, but a thorough answer will probably depend on knowledge the document's features, all classes, and all documents.


So we need to break the problem down before we can give an estimation of the probability P(c|D).

The first step is to rewrite using the concept of Bayesian inversion, based on Bayes's rule (which pops up in various other types of probability calculations and estimations).

             P(c) * P(D|c) 
P(c|D)  =  -----------------
                 P(D)

...applied for each class c in C.


P(c) is the base probability of each class, that is, the probability that a document is in class c given no document information.

This is sometimes assumed to be equal for all classes (in which case it falls away),
sometimes it is estimated, for example using the assumption that the amount of documents for each class (in the training set) is strong indication of how commonly one of its items is seen.

P(D) is the chance of finding the document, in general.

This is usually assumed to be independent of the classification problem.
Because of that, and because it is hard to realistically estimate for an unseen document, it is often assumed to be equal for all documents (so falls away).

P(D|c) is the interesting one. It asks for the probability that a particular document in a particular class.

This is still abstract, and not yet a directly calculable answer, but it is a smaller problem than we had a few paragraphs up: Unlike the earlier form (P(c|D)), it deals not with an immediate choice between alternatives, but only with an calculation/estimation of fit, of 'how well does a specific Document fit in this specific class?' .
This degree of fit for a document in a class can be done with any method that judges similarity, as long as it is (fairly) stable/consistent between documents, and it can be calculated (roughly) equally well for seen and unseen documents.


Long story somewhat shorter, Bayesian inversion reduces the classification problem from

"Choose best class immediately"

to

"find the class for which our estimation of fit is highest" (by making an estimator that can judge individual documents, and can do so stably / comparably well for all documents)

or from:

maxarg( P(ci|D) )

to:

maxarg( P(ci) * P(D|ci) )

or, assuming equal base-class probabilities, to:

maxarg( P(D|ci) )

Features, Words, and Naivety

So let's talk about an implementation.

If we want to calculate a class's fit based on features, we need to choose those features, and a method to calculate an arbitrary document's fit to those features.

It's nice if each feature is fairly robust in that it does not react in an unnecessarily noisy way, and productive for the classification task at hand.


In Naive Bayes introductions, the choice of features is often words (some introductions talk mainly of words, some talk more broadly about features), and the probabilities are based on the relative occurrence of each chosen word.

Words are a specific implementation choice, but are a nice introduction, in part because it speaks to the imagination in cases such as spam (even though it's limited for this, given that spammers know and try to defeat this model).


If you're programming such a toy example, then the amount of words is often reduced to maybe a thousand. The actual selection is interesting, in that they ought to be distinguishing within the training set ("the" may be common but not useful), but not be so unusual that they are rare within unseen documents.

There are cleverer ways of choosing words, and you probably want to consider the Zipf distribution of words, but to keep the example simple, let's say we take the most common 1000 in the training set, minus some obvious stopwords.


Naive Bayes is often explained by saying we want a probability based on n features in feature set F: P(c|f1..fn), which Bayes-inverses and denominator-eliminates to:

P(c) * P(f1..fn|c)

...which roughly means "how do features in this new document compare to the one for the class?"


Technically, P(f1..fn|c) should expand to a very long formula, as each feature depends on others. That's where the naivety of Naive Bayes comes in: the Naive Bayes assumption is simply that all features are independent of all others. This is rarely true, but is much simpler to work with, is evaluated much faster, and works better than you might expect. (this also effectively makes it a bag-of-words model)

It means calculation of the probability of the class based of the features simplifies to a simple multiplication of individual feature values / probabilities. Given there are n features:

P(c) * Πi=1 to n P(fi|c)


Note that the words-as-features setup is a #Bag_of_words.2Ffeatures_.28assumption.29 assumption: it plus the naivety means that word order is ignored. This means that potentially indicative phrases, collocations, and such are represented only weakly at best, by their words' independent probabilities being slightly higher. Word n-grams are one possible fix, but will likely run you into a sparseness problem since n-grams are implicitly rarer than single words. Even just bi-grams may be problematic - if you take all possible ones from a large set of documents, you'll find that most documents have a tiny fraction of the overall real use.

Bayesian classification as a procedure...

Naive Bayes training comes down to:

  • Deciding which set of features (F) you will use.
  • Train, which calculates characteristics for each class, namely:
    • P(c) for each class (see above), and more importantly,
    • all P(f|c): the probability of each feature f in each class c


Later, when classifying an unseen document, you calculate P(c) * P(f1..fn|c) for it:

  • for each c in C:
    • take P(c), times
    • the calculated probability of each feature in class c, i.e. each P(f|c) based on the document.
  • Choose the one with the highest probability, and you've got a classifier.


Choosing your features, and further discussion

Naive Bayes with simple word features works fairly well on various types of text, since it essentially notices differences in vocabulary use, the theory being that that is indicative of the document class. Common words can be useful when they aren't uniformly common between all classes, uncommon ones since they tend to up the probability of just a few classes.


You could simply use all words in all documents and end up with tens of thousands of features, although it would help speed to prune that a little. Which ones to leave is a choice you can be halfway smart about - although note that too much tweaking just means it will work better on the training data, with no guarantees for unseen data.

Also note that when you pre-process the training data, you need to process the unseen documents the same way. When choosing the features, you should consider what this may do to unseen documents.


A feature probability of zero is evil to calculations as it randomly destroys comparability in the both training and classification steps. It can happen fairly easily, such as when a word in the vocabulary doesn't occur in a document. A common solution is to cheat and pretend you did actually see it a little. There are a few different ways; some disturb the relative probabilities less than others.


You can add other features. The main property they should have is that their probabilities compare well, are in the same sort of scale. You could for example add important words a few times, a hackish way of weighing them more. You could for example add select word bigrams or add character n-grams.

You can alter the model further, observing that what you actually do is per-class strengthening or weakening of a probability that it will win. It doesn't even have to be a simple probability, even; you could for example include sentence length, which is not a direct probability, so you would have to e.g. remember the average sentence length in each class, calculate the average for the unseen document and define some metric of similarity. This isn't Naive Bayes anymore, but as a probability-based classifier it may work a little better if you choose your alterations well.


Pseudocode

This example pseudocode uses only word features, and some basic add-one(verify) smoothing to avoid the zero problem.

Training (word features)

U = feature universe
D = document set (text,class pairs)
C = set of classes in D
foreach c in C:
  BaseProb(c) = |D with class c| / |D|
  Text = concatenation of documents in current class
  N    = |tokens in Text|
  foreach token t in U:
    P(t|c) = ( numoccurence(t in Text) + 1 ) / (numtokens(Text)+|U| )


Classifying (word features)

foreach c in C:
  p=P(c)
  foreach token in Document
    p = p*P(token|c)

...then collect each of these probabilities and choose the class with the highest probability.


In reality, the probabilities quickly become very small and you run into the problem that standard 32-bit or 64-bit floating point numbers cannot accurately contain them. A common workaround for this is to add the logs of the probabilities (yielding negative numbers) instead of multiplying the probabilities. These values are much less likely to scale out of control, and they are as just as monotonous as the probabilities.

N-gram notes

N-grams are contiguous sequence of length n.


They are most often seen in computational linguistics.


Applied to sequences of characters it can be useful e.g. in language identification, but the more common application is to words.

As n-grams models only include dependency information when those relations are expressed through direct proximity, they are poor language models, but useful to things working off probabilities of combinations of words, for example for statistical parsing, collocation analysis, text classification, sliding window methods (e.g. sliding window POS tagger), (statistical) machine translation, and more


For example, for the already-tokenized input
This
is
a
sentence
.
the 2-grams would be:
This
,
is
is
,
a
a
,
sentence
sentence
,
.


Skip-grams

An extension of n-grams that allow allowing a certain amount of arbitrarily positioned omissions from the input.

Skip-grams come from speech analysis, processing phonemes.

In word-level analysis their purpose is a little different. You could say that we acknowledge the sparsity problem, and decide to get more out of the data we have (focusing on context) rather than trying to smooth.



Continuous bag of words

Sparse data smoothing

This article is an attempt at a decent overview, introduction, summary, or such.

It may currently just contain a mess of information. If it is likely to keep sucking it will probably be removed later.


The problem

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

(For relative brevity, the below omits details on training and validation, including terms like held-out, deleted interpolation. While important to execution, it's something you can look up when implementing)


Data sparsity is problematic for various statistical data models, such as simple maximum likelihood models, such as bag-of-words, n-gram, HMM, and others.


For example, in the set of all possible word trigrams (for language modeling or such), most have not happened in many years of text, and probably never will.

In genera, when you base probability (or such) fairly directly on appearance count, you can get a few problems:

  • components that occur very few times - relative probabilities make less sense on the rare end of the scale. Consider two items: one that appears once, one that appears three times. How fair is it to say that one is 300% as probable as the other, particularly when comparing two things that are almost never happen?
  • most of your trained data is spent on things you really don't care about
  • items that were not in your training data / trained model. It's hard to say what you want in such a case;
    • If you take a zero count to mean zero probability, you say it will never happen and it should drop out completely, which is a decision too harsh for most goals
    • ignoring components of an item can be that are not in training data may work well enough, except you can run into problems where the unseen items consists significantly of unknown components (or where the unknowns would in reality be the more telling ones, if looked at more specifically). You would get more variation in your estimations than you'ld want.
    • misspellings and different word order can sometimes make likelihood estimations take an unnecessary nosedive


When we're talking about likelihood, all ways of smoothing over these problems amount to changing the probability mass distribution somehow.

The most basic solutions are to fake the unknowns by pretending they exist but are simply rare. Some may try to be cleverer and boost some more than others.

The probability mass shift is also the reason many of the names have discounting in them: You take some probability from the likelier items to give them to rarer things. Do it too eagerly and you overestimate the rare and underestimate common things.

In some ways, assuming very low probability for unseen items (using only items we have halfway decent information on) has roughly the same effect and flaws as ignoring them, e.g. the nosedive mentioned. Smarter fallbacks of likelihood, even if more approximate, are often preferable.


For n-gram models

The larger the n in the n-gram, the larger the problem, because the more likely it is that you don't have something in your training data - you could say you need exponentially more data to get a decent estimation for larger n. For many types of data, n≥4 is already problematic. More data is good, but at some point a little fallback and smoothing is easier.

While not always mentioned along with smoothing, very low counts in count data are also a little weird. For example, if some n-gram appeard once and another twice, is it really twice as probable? (statistically, using very little information is problematic)

The smarter estimation (than assuming rareness) in such models often fall back onto a n-1 (one-piece-shorter) items.

Various solutions

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Additive smoothing (add-one, add-δ)

The simplest workaround to the zero-multiplication/zero-probability problem is add-one smoothing, a.k.a. Laplace smoothing, which adds one to every count (generally: to known_vocabulary + unknowns). In other words, pretend you saw everything a little more.

This is only really a fix to the zero-multiplication problem. The way it shifts the probability mass is not very natural - it gives a disproportional boost to rare things, and a lot of probability mass is taken away from unknowns when there a lot of unknowns.

It means underestimating commonly seen things somewhat, and overestimating unknowns (particularly relative to known but rare items). The effect depends on how much you add, and how high the counts are.

(In some methods this doesn't matter much because none of changes the order of the items, just the actual values. In some cases it will completely break things, though; it's a bad idea in n-gram models where you mix n-grams of different lengths)


Add-δ smoothing, a.k.a. Lidstone smoothing, is a variation that adds a smaller number, something between 0 and 1 and possibly based on the trained data. It mainly just plays with the magnitude of the distribution shift.

If the value is 0.5 this is also known as Expected Likelihood Estimation (ELE) and the Jeffreys-Perks Law.


All of these amount to saying that unknowns do exist and are very unlikely (and has less effect on well-known items), and handle all unknowns identically. If you can estimate more cleverly than that you probably should - and in many underlying models that's not very hard.

Various (parts of) methods of cleverness are somewhat portable between specific models.


Absolute discounting

Witten-Bell discounting

Intuitively: Unseen n-grams are just ones that are unseen until now. So give it an estimation (probability of seeing a new n-gram like this?(verify)).

(Can be considered an instance of Jelinek-Mercer smoothing(verify))


'Good-Turing smoothing / Good-Turing Discounting

Redistributes the probability mass. Assuming binomial distribution (though leaves better known things alone).

Its most basic definition is more of an idea, and not robust by itself, but a good base for some more advanced smoothers (involving backoff/interpolation)

See also:



Church-Gale smoothing

Good-Turing plus bucketing.





Kneser-Ney Smoothing

Probably the best available with the smoothing category, in that it uses few-count items cleverly effectively.

Similar to Witten-Bell in that it considers context, and can be considered an extension to absolute discounting

Works better with interpolation than with backoff


Modified Kneser-Ney Smoothing (introduced in Chen & Goodman) is an variation that seems to fairly consistently perform somewhat better.

Combined estimators

These aren't really just smoothing anymore, including many the variations of interpolation and backoff - which uses information on shorter items to estimate longer items with unseen parts. Intuitively: that data is necessarily not as sparse, and certainly a better estimate than additive smoothing.


For example, if both "john likes" and "john spork" are unknown, then you can estimate that 'john likes' is more likely by using the unigram 'likes' and 'spork'. If the trigram john likes mnemonics is unseen, you may find you have the bigram bigram john likes. While not actually what you asked, it's still a much better estimation than additive smoothing gives you, particularly since that would happen a lot for larger n.


Backoff: if there is no information on a given n-gram, use (n-1)-grams within it.

Interpolation: Estimate probability of every n-gram is estimated using (n-1)-grams - which is a recursive definition. It's also a relatively general idea that can be and is tweaked in many ways.

These are sensible ways of using larger n-grams without necessarily causing a huge shift in probability mass.

It's not very hard to alter a backoff algorithm into an interpolating one, and vise versa, though most algorithms are primarily seen as just one.


(Simple) Linear interpolation is a weighed combination of basic maximum likelihood and use of lower-order n-grams (a recursive definition).


We want to weigh each n-gram based on training data - and we want to train that from a data set distinct from that used for ML training. The common (named) choices:


Even with deleted interpolation, you won't easily get enough data.


Jelinek-Mercer smoothing - interpolating method that tries to get more coverage using further fuzziness, namely bucketing (), based on counts. Which can help even fairly small training sets perform decently.

There are variations that use different bucketing criteria


Apparently not used much anymore; there are simpler options that


Katz smoothing, a.k.a. Katz backoff (n-gram based) - Extends Good-Turing redistributing counts but using different-order models (can be said to be backoff). Often introduced for bigrams and unigrams, but can be taken recursively.

Reverts to use shorter n-gram context if count for the n-gram is lower than a given threshold.



Things that need to reserve data (Jelinek-Mercer, Witten-Bell, ) are not used so much anymore, because there are methods that don't require it (e.g. Kneser-Ney), which tend to work better.

-->


See also:


Unsorted

Class-based smoothing

Absolute discounting

Held-out estimation

See also

  • Chen & Goodman, An Empirical Study of Smoothing Techniques for Language Modeling [12]
    • ...its references

Unsorted