Classification, clustering, decisions, and fuzzy coding
This is more for overview of my own than for teaching or exercise.
|
This is more for overview of my own than for teaching or exercise.
|
Decisions, fuzzy coding
A classic machine learning is concerned with approximating data into rules, usually as simple as possible, and as automatic as workable.
The point is usually to handle unseen data, as correctly as can be expected, according to the given task.
It is interested in:
- Maximizing or improving measurable performance
- At a specific task
- Given some sort of experience
There is of course overlap with in data mining, natural language processing, character or other sorts of image recognition, machine learning, and other areas that have to deal with changing or inexact environments.
Learning versus adaptation
You may notice that learners are similar to agents (and their design) as defined by more classical artificial intelligence.
This is not unintentional, but may suggest more autonomy than will often be achieved. AI often focuses on learners that are robust to (can maintain performance under) change in the environment, and even to changes the controlled system (and possibly even to the learning process itself).
Such systems are more complex. One of the issues is that there is a feedback loop which can have issues of its own, for example oscillating feedback loops (note that feedback can both have useful converging effects and annoying ones such as oscillation -- depending on what?(verify)).
Adaptive controllers are necessary given the complexity of systems that interact. For example, the more sensors and actuators, the more prone to error design by parameter estimation is, and the more that adaptive learning can simplify this for the initial, but also unseen tasks.
On bias
Bias: Experience, Defining a Hypothesis Space, and the Target Function
One thing to realize is that many different decisions can bias your results in that it can cut off types of answers, skew away from things not commonly experienced (in a training set), and other things. Sometimes this is the point - and that's one of the reasons this is an issue. Consider that simplification of a document into a class can be the entire use of a system, but often enough there are decisions that will limit your system's expressiveness, which you should realize when you use it.
Experience
Central questions to ask are what experience actually consists of, how it should be represented, what parts of it are useful features, what should be learned, and even whether the data will conform to the restrictions on the algorithm(s) you'll use - in classification, for example, you need to know whether something only fits intside a single class, and how noise-robust the algorithm is and how noisy the data is.
Models will usually allow you to embed or seed with prior knowledge as well as learn from experience data. Sometimes that distintion isn't even that large.
Lightly sidestepping into AI for a second, you could define a task of playing checkers:
- T: Playing chekers
- P: Games won (this is a fairly general, long-term and loose goal, yes)
- E: Games played against itself
An interesting point to be made here is that training by example games wouldn't necessarily be so interesting option because that would likely focus learning on roughly copying legal moves, which is unlikely to be exhaustive or even tactical in the term of a few or even a single move - depending, of course, on how much prior data and domain theory you introduce.
Target function
Mathematically, you can see many algorithms-for-problems as a combination of:
- Input data (the unseen data)
- A target function (trained with experience)
- A desired output mapped to by that target function
In slightly more general terms, you canbe said to be looking for a a useful hypothesis (function) among many possible ones. Sometimes the hypothesis space (the possible ones) is finite and even computationally tractable, in which case you can calculate the best one. Often, however, you can't, and you will have to approximate the ideal target function.
The type of value that the target function produces is representative of what sort of algorithm you are dealing with. Discrete values will often be used to classify (and boolean often to classify into relevant or not, in eg. concept learning), whereas real values usually point to correlation and regression tasks.
You definition of the target function introduces Preference bias in that specifically your output will have limited range, and therefore be limited to specific values. More exactly, you will (have to) internally choose a specific representation for data.
Think of ways you would choose a target function for playing a move in chess. There are various ways to define this target function and what you would actually value when choosing a best. Consider for example that you can map a board position to a best move, or to that of the possible board positions that we like best. They sound the same, but will probably lead to different focus, and different direct evaluation. In fact, direct evaluation this in an integral part of he function: What do you actually prefer? Would you just value having more pieces? That and threatened pieces? That and the amount you threaten? All of that for all the types of your pieces? Doing just that will focus your tactics entirely on those aspects and nothing else -- basically because that will be the universe in terms of evaluation.
To come back to learning: choosing a nonoptimal representation isn't actually fatal. For example, in the chess playing setup just suggested, what you would probably actually decide to learn is how important each of those aspects are. In the long term your algorithm would probably learn that pawns can be quite useful, queens perhaps not so much as the prior knowledge you may put in there may suggest, and it may even discover that evaluation aspects you introduced are almost entirely irrelevant.
Experience
To learn with experience without supervision, things like iterative methods are used, which are seen to have at least the elements of experience and that of measuring performance.
That's a couple of new terms. The amount of supervision (supervised, unsupervised, semi-supervised) refers to how much human interaction is needed to make the system work at all, though of course interactive supervision will often lead to tweaking as well.
Iterative learning, or in more AI-like terms, reinforcement learning, is simply a step towards the unsupervised. It learns particularly what interactions are effective. Note that most basic ML isn't directly that advanced.
Fuzzy logic
Fuzzy logic is itself fuzzily defined.
It is mostly a formalization of working with degrees of applicability, or rather, degree of membership to a group, according to fuzzy set theory[1].
Superficially, partial membership is the same as probability, but there are differences you can care about.
Fuzzy logic is often contrasted with boolean logic, which in itself can be a misleading way to introduce it.
Fuzzy logic based decisions / control can often just as easily be described a conditional logic based with a few degrees (seemingly particularly so for the more basic, linear fuzzy set membership functions) and thresholds that are in themselves nothing special particularly in programming, which is where most fuzzy logic is applied, though they can be used to keep such a system simple, or keep it more numeric, abstract and/or apart from main code.
http://en.wikipedia.org/wiki/Fuzzy_logic http://en.wikipedia.org/wiki/Fuzzy_set
Methods / algorithms / searchers
There are a lot of algorithms, which can usually be said to model in specific ways. Usually these model specific types of structures well and others not so well.
Decision trees
Decision trees are used to classify: to decide.
They are, in some ways, no more than a codification of already-known rules.
In themselves they are memorization, not learning.
You can create such decision trees, but this requires fairly clean data.
Decision trees are essentially acyclic graphs where each edge/transition is the answer to the node it originates in.
For example, the root note could ask "Weather", from which "Overcast" leads to "Sad", and "Sunny" leads to "Happy."
Of course, such decision nodes can follow each other, which also puts them in the context of the information given by the decisions of the previous nodes. As such, a set of given attributes is a walk in the tree.
To be exact:
- Node: attribute
- Branch (edge): attribute value (discrete)
- leaf: classification
Such a tree can contain all training data in a dense tree, but that isn't very useful. Interestingly, on no-noise data it can do so fairly compactly. The value of this is the structure and the possible transformations on it. Pruning branches and making a decision is analogous to making a generalisation: for example, when all leafs after 'Sunny' lead to 'Happy', you can replace that subtree with a single leaf 'Happy' and have exactly the same data encoded.
Similarly, when you notice that most of them do, you can prune the tree and do the same simplification at the cost of only a little accuracy on unseen data - or such is the assumption when you assume training data is representative.
It is also interesting to make the most important/clearest decisions first. Among other things, it ensures pruning lower nodes will have minimal effect.
The usefulness of a decision can be measured by information gain, basically a measure of usefulness to the decision tree based on the amount of entropy an attribute causes.
ID3
In fact, the ID3 algorithm does just what is described above. It constructs a decision tree from scratch by locally using the node that adds the most value / information gain to the decision tree.
More exactly, it measures the entropy of each attribute value and weighs it by the fraction of cases it represents (verify). The gain of information is the current node's entropy minus all such values. It comes down to that more distinctive decisions will be chosen first, less ones later, meaning that the important decisions come first.
It also means this effect is purely local optimality and ignores depthwise considerations, meaning it is not guaranteed to be optimal, only a fair guess.
ID3 searches (and encodes) the entire hypothesis space. It is also statistically based, so can handle training data that is a little noisy: The information gain will be lower, but not necessarily disturn the process or choice.
Now note that instead of restruction bias (limiting the range of outcomes or attributes) we have preference bias: we decided we prefer 'useful' decisions based on some measure.
However, it is convenient. It is in the spirit of Occam's razor in that it tries to explain everything it sees with the shortest hypothesis. If such a short one exists, it is more likely to be structure in the data than coincidence.
In the context of a decision tree, overfitting often comes in the form of adding too many levels of deep tree detail - it will be rather specific for the training data, probably to the point that it has no generalizing power and it not dealing well with unseen data, or the entire general universe of possible data (not that that one's measurable).
Pruning (ID3, others)
Given a decision tree that contains all hypotheses is quite likely to be overfitting, so you probably want to generalize by pruning.
You could in theory do this during, e.g. figure that if your best dicision isn't statistically very strong you do not want to add a brancj.
More generally, you could post-prune trees. When you have less informative information about the construction of the tree, you can prune do so by a sort of trial and error based on a training set of data, for example by seeing which tree minimizes (treesize + misclassifications).
Pruning will often mean an accuracy increase because it will work for more general cases, but only a little because it will exclude other cases in the process.
Tree effects/extensions
You can limitedly use continuous values by using a threshold to map them into a boolean or a set of discrete values.
Attributes with many values will tend to be chosen for infogain when it isn't really.
There are alternatives to simple infogain
You can deal with missing values in various ways.
Rule Post-pruning; C4.5
C4.5 is an algorithm used to generate a decision tree
One thing you can do with a tree is turn it into equivalent rules - each path, after all, is a conjunction of conditions, equivalent to an if-then structure.
You can then prune arguments in each rule's expression, that do not seem to matter, based one something such as information gain, and sort the rules for application.
Apparently this is the C4.5 method. I'll have to look up the details.
http://en.wikipedia.org/wiki/C4.5_algorithm
https://octaviansima.wordpress.com/2011/03/25/decision-trees-c4-5/
J48 is one specific implementation
See also
Instance-based learning
"Just Store all data in memory" and find likely answer by analogy:
- Define distance measure between two
- Use distance measure for unseen-any training data member.
- Calcualte answer from most likely entry's target value
The nearest neighbour(s) based on euclidian distance in n-dimensional space -- when instances canbe mapped to such a space, and only when sensibly (as dimensions also interact).
Lazy method (decision on the spot), vs. eager (generalize/learn before using).
k-Nearest Neighbor (k-NN)
Use the k nearest neighbours, or just one.
Weighed variant: Weigh closer neighbours more, or possibly even all.
Semi-sorted
Markov Models, Hidden Markov Models
Something like (the simplest possible) Bayesian Belief Networks, but geared to streams of data. Can be seen as a state machine noting the likeliness of each next step based on a number of preceding steps.
The hidden variant only shows its output (and hides the model that produces it), the non-hidden one shows all of its state.
Simple ones are first-order
Expectation-maximization (EM) is regularly used. (EM: Iterative process based on maximum likelihood of untrustable (hidden) values)
Classification
Naive Bayes
Some background
A certain Reverend Thomas Bayes said a number of things about probability, used mostly in Bayesian inference in statistics, and in Naive Bayes classification.
Classification, in general, is usually based on a trained model (or sometimes hardcoded assumptions):
- deciding on a set of distinct target classes (say, exclusives as 'spam' vs. 'not spam', or rough article subjects (preferably exclusive, or classification itself does not make that much sense, although the value-for-class assignments still may)
- deciding some features to look at, that classification will be based on
- learning the importance of features to each class (often using good examples for each class)
Given something unseen to classify, you extract its features the same way you did for the training set, and see which class compares best. Classification is regularly many fuzzy measures of fit followed by a hard choice of which seems best.
Naive Bayes refers to one of the simpler algorithms that does classification this way.
Its definition has a few notational conventions:
- Each training class is referred to as c in the set of classes C (c1...ci)
- We eventually want to judge an unseen document D
- and what we want for D is is the class for which the probability the document fits that class is highest
You can express such Bayesian classification as:
the c for which P(c|D) is highest
or, a bit more formally yet,
c = maxarg( P(ci|D) )
Bayes's rule
The P(c|D) above is not obvious to calculate, because it immediately asks us for a best choice for a document, and a thorough answer of that will probably depend on multiple pieces of information: document's features, all classes, and all documents.
So we need to break the problem down before we can give an estimation of the probability P(c|D),
and the first step is to apply Bayes's rule (which pops up in various other types of probability calculations and estimations).
P(D|c) * P(c) P(c|D) = ----------------- P(D)
...for each class c in C.
Looking at the parts:
P(c) is the base probability of each class, that is, the probability that a document is in class c given no other information.
- This is sometimes assumed to be equal for all classes (in which case it falls away),
- sometimes it is estimated, for example using the assumption that the amount of training examples in each class is a direct indication of how commonly an item falls into that class.
P(D) is the chance of finding the document, in general.
- This is usually assumed to be independent of the classification problem.
- because of that, and because it is hard to realistically estimate for an unseen document, it is often assumed to be equal for all documents (so falls away).
...so P(D|c) is the interesting one. It asks for the probability that a particular document in a particular class.
- This is still abstract, and not yet a directly calculable answer, but it is a smaller problem than we had a few paragraphs earlier: Unlike the earlier form (P(c|D)) which deals with an immediate choice between alternatives, we now deal mainly/only with an calculation/estimation of fit, of 'how well does a specific Document fit in every specific class?' .
- This degree of fit for a document in a class can be done with any method that judges similarity, as long as
- it is (fairly) stable/consistent between documents, and
- it can be calculated (roughly) equally well for seen and unseen documents.
Long story somewhat shorter, applying Bayes' rule reduces the classification problem from
- "Choose best class immediately"
to
- "find the class for which our estimation of fit is highest" (by making an estimator that can judge individual documents, and can do so stably / comparably well for all documents)
Put another way, from:
maxarg( P(ci|D) )
to:
maxarg( P(ci) * P(D|ci) )
or, assuming equal base-class probabilities, to:
maxarg( P(D|ci) )
Features, Words, and Naivety
So let's talk about an implementation.
If we want to calculate a class's fit based on features, we need to choose those features, and a method to calculate an arbitrary document's fit to those features.
It would be nice if each feature is
- fairly robust in that it does not react in an unnecessarily noisy way, and
- productive for the classification task at hand.
In Naive Bayes introductions, the choice of features is often words, and the probabilities are often based on the relative occurrence of each chosen word.
Words are maybe not the best possible choice, but are great for introductions, simple enough in that we have everything we need already and we aren't adding anything complex. (it also speaks to the imagination for things like spam, even if spammers know and try to defeat this model).
If you're programming such a toy example, then the amount of words is often reduced to maybe a few thousand. The actual selection is interesting, in that they ought to be distinguishing within the training set ("the" may be common but not useful), but not be so unusual that they are rare within unseen documents.
There are cleverer ways of choosing words, and you probably want to consider the Zipf distribution of words, but to keep the example simple, let's say we take the most common 1000 in the training set, minus some obvious stopwords.
Naive Bayes is often explained by saying we want a probability based on n features in feature set F: P(c|f1..fn), which Bayes-inverses and denominator-eliminates to:
P(c) * P(f1..fn|c)
...which roughly means "how do features in this new document compare to the one for the class?"
Technically, P(f1..fn|c) should expand to a very long formula, as each feature depends on others. That's where the naivety of Naive Bayes comes in: the Naive Bayes assumption is simply that all features are independent of all others. This is rarely true, but is much simpler to work with, is evaluated much faster, and works better than you might expect. (this also effectively makes it a bag-of-words model)
It means calculation of the probability of the class based of the features simplifies to a simple multiplication of individual feature values / probabilities. Given there are n features:
P(c) * Πi=1 to n P(fi|c)
Note that the words-as-features setup is a #Bag_of_words.2Ffeatures_.28assumption.29 assumption: it plus the naivety means that word order is ignored. This means that potentially indicative phrases, collocations, and such are represented only weakly at best, by their words' independent probabilities being slightly higher. Word n-grams are one possible fix, but will likely run you into a sparseness problem since n-grams are implicitly rarer than single words.
Even just bi-grams may be problematic - if you take all possible ones from a large set of documents, you'll find that most documents have a tiny fraction of the overall real use.
Bayesian classification as a procedure...
Naive Bayes training comes down to:
- Deciding which set of features (F) you will use.
- Train, which calculates characteristics for each class, namely:
- P(c) for each class (see above), and more importantly,
- all P(f|c): the probability of each feature f in each class c
Later, when classifying an unseen document, you calculate P(c) * P(f1..fn|c) for it:
- for each c in C:
- take P(c), times
- the calculated probability of each feature in class c, i.e. each P(f|c) based on the document.
- Choose the one with the highest probability, and you've got a classifier.
Choosing your features, and further discussion
Naive Bayes with simple word features works fairly well on various types of text, since it essentially notices differences in vocabulary use, the theory being that that is indicative of the document class. Common words can be useful when they aren't uniformly common between all classes, uncommon ones since they tend to up the probability of just a few classes.
You could simply use all words in all documents and end up with tens of thousands of features, although it would help speed to prune that a little. Which ones to leave is a choice you can be halfway smart about - although note that too much tweaking just means it will work better on the training data, with no guarantees for unseen data.
Also note that when you pre-process the training data, you need to process the unseen documents the same way. When choosing the features, you should consider what this may do to unseen documents.
A feature probability of zero is evil to calculations as it randomly destroys comparability in the both training and classification steps. It can happen fairly easily, such as when a word in the vocabulary doesn't occur in a document. A common solution is to cheat and pretend you did actually see it a little. There are a few different ways; some disturb the relative probabilities less than others.
You can add other features. The main property they should have is that their probabilities compare well, are in the same sort of scale. You could for example add important words a few times, a hackish way of weighing them more. You could for example add select word bigrams or add character n-grams.
You can alter the model further, observing that what you actually do is per-class strengthening or weakening of a probability that it will win. It doesn't even have to be a simple probability, even; you could for example include sentence length, which is not a direct probability, so you would have to e.g. remember the average sentence length in each class, calculate the average for the unseen document and define some metric of similarity. This isn't Naive Bayes anymore, but as a probability-based classifier it may work a little better if you choose your alterations well.
Pseudocode
This example pseudocode uses only word features, and some basic add-one(verify) smoothing to avoid the zero problem.
Training (word features)
U = feature universe D = document set (text,class pairs) C = set of classes in D foreach c in C: BaseProb(c) = |D with class c| / |D| Text = concatenation of documents in current class N = |tokens in Text| foreach token t in U: P(t|c) = ( numoccurence(t in Text) + 1 ) / (numtokens(Text)+|U| )
Classifying (word features)
foreach c in C: p=P(c) foreach token in Document p = p*P(token|c)
...then collect each of these probabilities and choose the class with the highest probability.
In reality, the probabilities quickly become very small and you run into the problem that standard 32-bit or 64-bit floating point numbers cannot accurately contain them.
A common workaround for this is to work with log probabilities. These values are much less likely to scale out of control, and we mostly care about their monotonicity.
Semi-sorted
- Parzen classifier
- Backpropagation classifier
Evaluating classifiers
Precision/recall, accuracy, variant/hybrid methods (, valuing precision over recall.)
You shouldn't test on the training data, because even if it completely failed to generalize, it might just be remembering to give the right output because it saw the precise.
So once your set of annotated data is large enough, you can start splitting some of it off into an evaluation/testing set.
Instead of using that for training, you pretend it's unseen data -- that you happen to know the correct answer for
When you have enough data, you could split into training, evaluation and a completely separate evaluation set that you use for nothing else and see as your unseen data. Usually this will contain some bias, that of the fact it comes from the same data collection, but it is a pretty good test of how a system tweaked to training data works with data it hasn't seen before, and should reveal if it is unflexible.
When data is scarce, you can do an n-fold crossvalidation.
This basically just means you do the above with many different training/evaluation splits, and average the accuracy.
It makes it less likely that your test won't accidentally be skewed by a few odd cases.
Though it needs to be said that if it's not too hard to annotate more data, that may be worth more.
Combining classifiers
Ensemble learning
Support Vector Machines
Support-vector regression
Using the kernel method
Clustering
Intro
Clustering takes input and assigns a label from a set of discrete labels.
Some variants it assign multiple, then often with relative scores.
Clustering can be split into a few related sub-problems, including
- cluster formation - organizing elements into groups
- cluster segmentation - dealing with boundaries between groups
- labeling - assigning meaningful names
- for the (relatively few) cases where this makes sense to estimate
- deciding how many groups to have in the result
- evaluation of a solution (possibly feeding back into the previous point)
Clustering methods vary in in intermediate definitions, e.g.
- method of calculating the similarity/distance (the metric)
- intermediate definitions, such as agglomerate distances in hierarchical clustering
- thresholding or other means of inclusion/exclusion decision per cluster, possibly a target amount of clusters.
Some of the simplest clustering might be more formally described as:
- you have a set of n data objects, call it D = { d1, ..., dn }
- in its simplest shape, a clustering result is a disjoint partitioning of D
- which makes clustering itself a function, mapping each datapoint to a cluster number/label that indicate membership of said cluster
The input data is often either
- a set of a points in a many-dimensional space, usually a vector space, plus a metric to calculate distance between them, OR
- a set of already-chosen distances
- preferably complete, but depending on how it's made it might be sparse, and it may be easier to do some fuzzy statistical estimation than to ask people to complete it
The latter may well be in the form of a distance matrix / similarity matrix.
A bunch of methods given datapoints-plus-metric convert to that internally, but starting data-plus-metric is often a little more flexible up front - both for data massaging, and sometimes for implementation reasons.
That said, the choice of metric takes care,
because there are many ways to accidentally put some bias into the metric.
Many methods look at element-to-element similarities/dissimilarities, while a few choose to be more involved with the data that comes from (e.g. some Maximum-likelihood-based methods common in bioinformatics)
Variations
Hard, soft, and fuzzy clustering
Hard clustering means each item should be assigned to a single group. This is essentially a partitioning.
Soft clustering means something can belong to more than one group.
Regularly used for data known to be too complex to be reduced cleanly with hard clustering, such as when there are closeby, overlapping, or ambiguous groups.
Soft clustering is generally understood as boolean soft clustering: something can belong to one or more clusters, but there are no degrees.
Fuzzy clustering is soft clustering plus degrees of membership.
This means intermediate results are effectively still moderately high-dimensional data, you often still have to make a decision about exclusion, thresholds or such (preferably within the algorithm, to have all information available).
If you don't make such a decision, the result more resembles dimensionality reduction.
Agglomerative versus divisive
Agglomerative clustering usually starts with each item in its own cluster and merges them where it seems a good idea.
Divisive clustering usually starts with every item in a single cluster and iteratively splits them as it sees fit, stopping either when some goal is satisfied, or sometimes until everything is split.
Differences include:
- what side they err on in unclear cases.(verify)
- divisive by nature makes its decision on more data, so can be more accurate
- e.g. early combination decisions in agglomerative may be less sensible but cannot be undone without making the method more complex
- divisive is often more efficient to calculate
- divisive is a little more supervised, in that you may need to explecitly give it a way to split that makes sense of the data
In general, divisive clusters seems better at finding large clusters,
agglomerative clustering is good at finding small clusters.(verify)
Hierarchical clustering
Hierarchical clustering creates a tree of relations, often by an process where we keep tracks of how things join, rather than just assimilate things into a larger blob.
Hierarchical clusterers can be flexible, in that their results can partition into an arbitrary number of groups (by choosing the depth at which there are that amount of groups).
Depending on the data these results contain may also be useful as an approximation for fuzzy clustering. They may also be a little more helpful in cluster stability tests.
Some algorithms record and retain he distances of (/stress at) each such joint.
These can be interesting to visualize (think dendrograms and such),
and to effectively allow the amount-of-cluster choice to be made later (think threshold in a dendrogram).
http://gepas.bioinfo.cipf.es/cgi-bin/docs/clusterhelp
Notes on....
Group number choice
Some algorithms try to decide on a suitable number of target groups, but many require you to choose an exact number before they get started.
This number is difficult to decide since there is usually is no well-defined, implicit, calculable best choice.
This is a problem particularly in hard clustering, because any decision of group membership is very final.
The membership of bordercases may not be stable under even the slightest amount of (sample) noise.
Things you can do include:
- use evaluation to measure the fitness of a solution (or sub-solutions while still clustering), based
Note that such a metric in itself is only a relative value in a distribution you don't know - you'll often have to calculate the fitness for many solutions to get a still-vague idea of fitness.
- In the case of hard clustering, you can intentionally add some noise and see how much the membership of each item varies - and, say, report that as the confidence we have in a choice.
- use some type of cross-validation
Inter-cluster and intra-cluster comparisons; susceptibilities
Depending on the algorithm, you often want to be able to compare
- items to clusters (agglomerative and divisive decisions)
- clusters to clusters (e.g. in hierarchical decisions)
- items to items (e.g. for centroid/medoid decisions)
Cluster-to-cluster distances are most interesting to hierarchical clustering, and can be calculated in a number of ways (usually hardwired into the algorithm), including:
- Single-link, a.k.a. minimum linkage or nearest linkage
- the most similar combination (lowest distance) of possible comparisons
- more susceptible to over-chaining than most other methods
- Complete-link , a.k.a. max/farthest linkage
- uses the least similar (largest distance) combination of possible comparisons
- often gives a non-chained, more equally divided clusters than single-link
- outliers may have disproportional influence
- Average-link (a.k.a. group-average, a.k.a. Group-average (agglomerative) clustering (GAAC)) - average of distances between all inter-cluster pairs
- less sensitive to outliers than complete-link, less sensitive to inversion than centroid approaches.
- Centroid approaches use a calculated average for comparison to a cluster
- quite susceptible to inversions (verify)
- Medoid approaches try to use a representative item for comparison to a cluster
- somewhat susceptible to inversions (verify)
- Density-based methods care more about local density of items and less directly about the exact distances involved
- Ward's method
- based on Ward criterion, a.k.a. the Ward minimum variance criterion
Potential problems:
- outliers
- including a single outlier may drastically change comparisons to that group.
- The timing of their inclustion can have significant effects on the result.
- the chaining effect refers to algorithms doing a chain/string of assignments of to a group. The concept is clearest in
- Inversions (sometimes 'reversal' or 'non-monotonicity') - describes when similarity values do not decrease monotonously in a series of iterations
- easily happens when a process makes decisions based on centers that move in the process of clustering (such as in many centroid-style processes) particularly when combined with cases where there is no clear clustering solution.
See also:
TODO: read
- http://people.revoledu.com/kardi/tutorial/Similarity/index.html
- http://www.google.com/search?hl=en&q=Mahalanobis+distance
on convergence
Convergence is a nontrivial check in many algorithms.
You could check whether the group assignments have not changed, but this is sensitive to oscillations, resulting in a premature report of convergence and/or a failure to converge (depending somewhat on logic and data size).
A simple threshold is arbitrary since the error values often depend on the scale of the data values (which is not very trivial to correct for(verify)).
This is occasionally solved by error minimization criteria, for example minimization-of-the-sum-of-squares.
There are some details to this. For example, reallocating a point between clusters, various methods consider only the error decrease in the target cluster - while the solution's total error may increase. It usually still converges, but the total error decreases with a little more oscillation, which "no significant improvement in the last step" terminating criterion may be sensitive to (though arguably it's always more robust to check whether the error decrease is roughly asymptotic with the minimum you presume it'll get to).
The idea resembles Expectation Maximization (EM) methods in that it tries to maximize the probability of the clusters being the correct by minimizing the energy/error.
Purely random initial positioning may cause the local minimum problem. Smartly seeding the initial centroids helps and need not be too computationally expensive - and in fact helps convergence; see e.g. k-means++.
Alternatively, you could run many versions of the analysis, each with random initial placement, and see see whether (and/or to which degree) the results are stable, but this can be computationally expensive.
Robustness in hard clustering
Particularly hard clusterers are often not robust against even minor variations in the data. That is, separate data sets that are highly correlative may lead to significantly different results; areas in which membership is borderline flip-flip under the tiniest (sample) noise.
You can evaluating a solution for stress (or correlate distances it implies to the original data e.g. in hierarhical data), though in itself this is only a general thing. It is meaningful the same measre from other clusterings of the same data, meaning that you *can* roughly compare different solutions for expression of the original data, but only in a roughly converging way.
One trick is to cause the problem and test how varied results will be over mild variations over the data. You can for example repeat the clustering some amount of times with some noise, and record how often things change membership.
You can repeat the clustering omitting random pieces of data to lessen the effect of outliers - pieces of data that do not agree with the rest.
You can even aggregate the results from these runs and combine them into a sort of fuzzy cluster result that can show you instabilities, and/or converge on a clusters amount choice.
Evaluation
Silhouette coefficients
Davies-Bouldin index
Gap statistic
Related
Clustering as neighbour search
Hamming embedding
Unsorted
Implementation notes
k-means
The k-means problem is finding a cluster labelling for a given amount of clusters (k) with minimal error, where the error function is based on the the within-group sum of squares.
(For completeness, that means for all elements in a group, calculate the square of the euclidean distance to the centroid, and sum up all these squares, which gives per-cluster error values. Various convergence checks will want to know the sum of these errors)
Most implementations are iterative and look something like:
- Position k cluster centroids (at random, or sometimes slightly more cleverly)
- For each element, assign to the nearest centroid
- Recalculate (affected) centroid means (and often the error/energy at the same time)
- Check whether the moved centroids change the element assignments.
- If so, iterate
- If no change, we have converged and can stop
Of iterative clustering methods, k-means is the simplest and many others can be said to be based on it.
K-means gives better better results if the value for k is a good choice, representative for the data.
(it's not unusual to try various k and test them)
The common 'nearest cluster' criterion will avoid attraction of multiple clusters, and the whole will converge to a decent solution for k groups.
Limitations:
- results are sensitive to initial placement, and it is easy to get stuck in a local minimum.
- It is not unusual to run the clustering various times with different starting clusters and see how stable the clustering is.
- If k is not representative of the structure in the data the solution may not be satisfying at all. This is partly caused by, and partly independent of, the fact that there may be various possible stable clusterings.
- the simple distance metric means we say the shape for inclusion is always a circle
Average-case runtime is decent because it's a fairly simple algorithm.
Worst-case runtime, for fairly pathological datasets, is fairly quite high. There are faster approximations of k-means that you may wish to consider.
You can tweak k-means in various ways. For example, you can assign weights (based of frequency, importance, etc) to elements to affect the centroid mean recalculation.
See also:
- Wikipedia: K-means algorithm
- Arthur & Vassilvitskii, k-means++ The Advantages of Careful Seeding
- JA Lloyd (1957), Least Squares Quantization in PCM
- EW Forgy (1965), Cluster analysis of multivariate data: efficiency vs interpretability of classifications
- MacQueen (1967), Some methods for classification and analysis of multivariate observations.
- Hartigan and Wong (1979), A K-means clustering algorithm
- Kanungo, Mount, Netanyahu, Piatko, Silverman, and Wu, An efficient k-means clustering algorithm: Analysis and implementation
- Bottou, Bengio, Convergence Properties of the K-Means Algorithms
Variations:
- ISODATA (Iterative Self-Organising Data Analysis Technique Algorithm) builds on k-means and tries to be smart in initial positions and in variations of k(verify).
- H-means is a variation on k-means that recalculates the centroid only after a complete iteration over all the items, not after each reassignment.
- Seen one way, it checks for error decrease less often, which makes it a smidge more sensitive to local minima and perhaps doesn't converge as nicely.
- Although the difference in practice tends to be slight, k-means tends to be the slightly safer (and much more common) choice, even if the order it handles elements in is a different kind of bias that h-means avoids.
Canopy clustering
https://en.wikipedia.org/wiki/Canopy_clustering_algorithm
Bisecting k-means
hard c-means
UPGMA, WPGMA
UPGMA = Unweighed Pair Group Method using Arithmetic averaging
WPGMA = Weighed Pair Group Method using Arithmetic averaging
(These specific names/abbreviations come from Sneath and Sokal 1973)
UPGMA assigns equal weight to all distances:
D((u,v),w) = (nu*D(u,w) + nv*D(v,w)) / (nu+nu)
WPGMA uses:
D((u,v),w) = (nu*D(u,w) + nv*D(v,w)) / 2
In the unweighed variant, the two things being combined weigh equally,
in the weighed variant, all leaves that are part of a cluster weigh in as much as the other part.
Bottom-up combiners working from a difference matrix, combining whatever leaf/cluster distance is minmal, then recalculating the difference matrix.
It's not too hard too argue this terminology is a little arbitrary
See also:
- P H Sneath, R R Sokal (1973) "Numerical Taxonomy: The Principles and Practice of Numerical Classification"
UPGMC, WPGMC
Unweighed Pair Group Method using Centroids, and Weighed Pair Group Method using Centroids
(the specific names/abbreviations come from Sneath and Sokal 1973)
Fuzzy c-means (FCM)
- Type: fuzzy clustering (not really partitioning anymore)
Method: Like k-means, but weighs centroid recentering calculations by fuzzy distance to all data points
http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/cmeans.html
http://www.scholarpedia.org/article/Fuzzy_C-Means_Cluster_Analysis
Fuzzy k-means
Shell clustering
Basic fuzzy c-means would include by a radius - a sphere.
There are various other options, including:
- fuzzy c-quadric shells algorithm (FCQS) detects ellipsoids
- fuzzy c-varieties algorithm (FCV) detects infinite lines (linear manifolds) in 2D
- adaptive fuzzy c-varieties algorithm (AFC): detects line segments in 2D data
- fuzzy c-shells algorithm (FCS) detects circles
- fuzzy c-spherical shells algorithm (FCSS) detects circles
- fuzzy c-rings algorithm (FCR) detects circles
- fuzzy c-rectangular shells algorithm (FCRS) detects rectangles
- Gath-Geva algorithm (GG) detects ellipsoids
- Gustafson-Kessel algorithm (GK) detects ellipsoids of roughly the same size
PAM
Partitional, medoid-based
CLARA
Partitional, medoid-based
AGNES
Hierarchical
AGglomerative NESting
DIANA
Hierarchical, divisive
DIvisive ANAlysis - the idea is that at every step, the most homogenous cluster is split in two
Buckshot
Hybrid
Birch
Hybrid
BIRCH (balanced iterative reducing and clustering using hierarchies)
https://en.wikipedia.org/wiki/BIRCH
Cure
Hybrid
Clustering Using REpresentatives
https://en.wikipedia.org/wiki/CURE_algorithm
Rock
Hybrid
"Rock: A robust clustering algorithm for categorical attributes"
Chameleon
Hybrid, Hierarchical
"CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling"
DBSCAN
Density-based.
When you can reasonably assume that real objects will always be a dense cloud of points, you can more easily rejects random points as noise/outliers (even if relatively close).
It can also deal decently with closeby odd-shaped clusters, if separated cleanly, which distance-based things like k-means will almost always have trouble with.
Has trouble when there are meaningful clusters of varying density.
http://en.wikipedia.org/wiki/DBSCAN
HDBSCAN
Hierarchical DBSCAN.
Better at finding clusters of varying densities, which also ends up doing better at odd shapes?
https://github.com/scikit-learn-contrib/hdbscan
GDBSCAN
Generalized DBSCAN
OPTICS
Similar to DBSCAN, addresses, the 'what if there are meaningful clusters of varying density' issue that DBSCAN has.
https://en.wikipedia.org/wiki/OPTICS_algorithm
FLAME
Fuzzy, density-based
http://en.wikipedia.org/wiki/FLAME_Clustering
Clustering by Committee (CBC)
Hybrid
Clustering By Committee (CBC) is an algorithm to extract similar words, based on the intuition that said words occur in the same contexts.
Based on responsive elements (a comittee) voting on specific outcomes.
See also:
- http://www.aclweb.org/aclwiki/index.php?title=CBC
- PA Pantel (2003) Clustering by Committee [2]
- P. Pantel and D. Lin (2002) Discovering Word Senses from Text
Principal Direction Divisive Partitioning (PDDP)
Information Bottleneck
See also:
- https://en.wikipedia.org/wiki/Information_bottleneck_method
- N Slonim, "The Information Bottleneck: Theory and Applications"
Agglomerative Information Bottleneck
See also:
- N Slonim, N Tishby (1999) "Agglomerative Information Bottleneck"