Data clustering
This is more for overview (my own) than for teaching or exercise.
For some more indepth material to study, see e.g. the http://www.khanacademy.org/
Other data analysis, data summarization, learning

Contents
 1 Intro
 1.1 Variations
 1.2 Notes on....
 1.3 Implementation notes
 1.3.1 kmeans
 1.3.2 Canopy clustering
 1.3.3 Bisecting kmeans
 1.3.4 hard cmeans
 1.3.5 UPGMA, WPGMA
 1.3.6 UPGMC, WPGMC
 1.3.7 Fuzzy cmeans (FCM)
 1.3.8 Fuzzy kmeans
 1.3.9 Shell clustering
 1.3.10 PAM
 1.3.11 CLARA
 1.3.12 AGNES
 1.3.13 DIANA
 1.3.14 Buckshot
 1.3.15 Birch
 1.3.16 Cure
 1.3.17 Rock
 1.3.18 Chameleon
 1.3.19 DBSCAN
 1.3.20 OPTICS
 1.3.21 FLAME
 1.3.22 Clustering by Committee (CBC)
 1.3.23 Principal Direction Divisive Partitioning (PDDP)
 1.3.24 Information Bottleneck
 1.3.25 Agglomerative Information Bottleneck
 1.3.26 Expectation Maximization Clustering
 1.4 Related fields
 1.5 See also
Intro
Clustering groups a few related subproblems, including
 cluster formation  organizing into clusters
 cluster segmentation  dealing with boundaries (often using cluster centers)
 labeling  assigning meaningful names
 for the (relatively few) cases where this makes sense to estimate
 deciding how many groups to have in the result
 evaluation of a solution (possibly feeding back into the previous point)
Formally, the simplest clustering ca be described as:
 you have a set of n data objects, call it D = { d_{1}, ..., d_{n} }
 in its simplest shape, a clustering result is a disjoint partitioning of D
 which makes clustering itself a function, mapping each datapoint to a cluster number/label that indicate membership of said cluster
The input data is often either
 a set of a points in a manydimensional space, usually a vector space, plus a metric to calculate distance between them, OR
 a set of alreadychosen distances
 preferably complete, but depending on how it's made it might be sparse, and it may be easier to do some fuzzy statistical estimation than to ask people to complete it
The latter may well be in the form of a distance matrix / similarity matrix.
A bunch of methods given datapointsplusmetric convert to that internally, but starting dataplusmetric is often a little more flexible up front  both for data massaging, and sometimes for implementation reasons.
That said, the choice of metric takes care,
because there are many ways to accidentally put some bias into the metric.
Many methods look at elementtoelement similarities/dissimilarities, while a few choose to be more involved with the data that comes from (e.g. some Maximumlikelihoodbased methods common in bioinformatics)
Variations
Hard, soft, and fuzzy clustering
Hard clustering means each item should be assigned to a single group. This is essentially a partitioning.
Soft clustering means something can belong to more than one group.
Regularly used for data known to be too complex to be reduced cleanly with hard clustering, such as when there are closeby, overlapping, or ambiguous groups.
Soft clustering is generally understood as boolean soft clustering: something can belong to one or more clusters, but there are no degrees.
Fuzzy clustering is soft clustering plus degrees of membership.
This means intermediate results are effectively still moderately highdimensional data, you often still have to make a decision about exclusion, thresholds or such (preferably within the algorithm, to have all information available).
If you don't make such a decision, the result more resembles dimensionality reduction.
Agglomerative versus divisive
Agglomerative clustering usually starts with each item in its own cluster and merges them where it seems a good idea.
Divisive clustering usually starts with every item in a single cluster and iteratively splits them as it sees fit.
The difference seems to lie largely in what side they err on in unclear cases.(verify)
Hierarchical clustering
Hierarchical clustering creates a tree of relations, often by an process where we keep tracks of how things join, rather than just assimilate things into a larger blob.
Hierarchical clusterers can be flexible, in that their results can partition into an arbitrary number of groups (by choosing the depth at which there are that amount of groups).
Depending on the data these results contain may also be useful as an approximation for fuzzy clustering. They may also be a little more helpful in cluster stability tests.
Some algorithms record and retain he distances of (/stress at) each such joint.
These can be interesting to visualize (think dendrograms and such),
and to effectively allow the amountofcluster choice to be made later (think threshold in a dendrogram).
http://gepas.bioinfo.cipf.es/cgibin/docs/clusterhelp
Notes on....
Group number choice
Some algorithms try to decide on a suitable number of target groups, but many require you to choose an exact number before they get started.
This number is difficult to decide since there is usually is no welldefined, implicit, calculable best choice.
This is a problem particularly in hard clustering, because any decision of group membership is very final.
The membership of bordercases may not be stable under even the slightest amount of (sample) noise.
Things you can do include:
 use evaluation to measure the fitness of a solution (or subsolutions while still clustering), based
Note that such a metric in itself is only a relative value in a distribution you don't know  you'll often have to calculate the fitness for many solutions to get a stillvague idea of fitness.
 In the case of hard clustering, you can intentionally add some noise and see how much the membership of each item varies  and, say, report that as the confidence we have in a choice.
 use some type of crossvalidation
Intercluster and intracluster comparisons; susceptibilities
Depending on the algorithm, you often want to be able to compare
 items to clusters (agglomerative and divisive decisions)
 clusters to clusters (e.g. in hierarchical decisions)
 items to items (e.g. for centroid/medoid decisions)
Clustertocluster distances are most interesting to hierarchical clustering, and can be calculated in a number of ways (usually hardwired into the algorithm), including:
 Singlelink, a.k.a. minimum linkage or nearest linkage
 the most similar combination (lowest distance) of possible comparisons
 more susceptible to overchaining than most other methods
 Completelink , a.k.a. max/farthest linkage
 uses the least similar (largest distance) combination of possible comparisons
 often gives a nonchained, more equally divided clusters than singlelink
 outliers may have disproportional influence
 Averagelink (a.k.a. groupaverage, a.k.a. Groupaverage (agglomerative) clustering (GAAC))  average of distances between all intercluster pairs
 less sensitive to outliers than completelink, less sensitive to inversion than centroid approaches.
 Centroid approaches use a calculated average for comparison to a cluster
 quite susceptible to inversions (verify)
 Medoid approaches try to use a representative item for comparison to a cluster
 somewhat susceptible to inversions (verify)
 Densitybased methods care more about local density of items and less directly about the exact distances involved
 Ward's method
 based on Ward criterion, a.k.a. the Ward minimum variance criterion
Potential problems:
 outliers
 including a single outlier may drastically change comparisons to that group.
 The timing of their inclustion can have significant effects on the result.
 the chaining effect refers to algorithms doing a chain/string of assignments of to a group. The concept is clearest in
 Inversions (sometimes 'reversal' or 'nonmonotonicity')  describes when similarity values do not decrease monotonously in a series of iterations
 easily happens when a process makes decisions based on centers that move in the process of clustering (such as in many centroidstyle processes) particularly when combined with cases where there is no clear clustering solution.
See also:
TODO: read
 http://people.revoledu.com/kardi/tutorial/Similarity/index.html
 http://www.google.com/search?hl=en&q=Mahalanobis+distance
on convergence
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
Convergence is a nontrivial check in many algorithms.
You could check whether the group assignments have not changed, but this is sensitive to oscillations, resulting in a premature report of convergence and/or a failure to converge (depending somewhat on logic and data size).
A simple threshold is arbitrary since the error values often depend on the scale of the data values (which is not very trivial to correct for(verify)).
This is occasionally solved by error minimization criteria, for example minimizationofthesumofsquares.
There are some details to this. For example, reallocating a point between clusters, various methods consider only the error decrease in the target cluster  while the solution's total error may increase. It usually still converges, but the total error decreases with a little more oscillation, which "no significant improvement in the last step" terminating criterion may be sensitive to (though arguably it's always more robust to check whether the error decrease is roughly asymptotic with the minimum you presume it'll get to).
The idea resembles Expectation Maximization (EM) methods in that it tries to maximize the probability of the clusters being the correct by minimizing the energy/error.
Purely random initial positioning may cause the local minimum problem. Smartly seeding the initial centroids helps and need not be too computationally expensive  and in fact helps convergence; see e.g. kmeans++.
Alternatively, you could run many versions of the analysis, each with random initial placement, and see see whether (and/or to which degree) the results are stable, but this can be computationally expensive.
Robustness in hard clustering
Particularly hard clusterers are often not robust against even minor variations in the data. That is, separate data sets that are highly correlative may lead to significantly different results; areas in which membership is borderline flipflip under the tiniest (sample) noise.
You can evaluating a solution for stress (or correlate distances it implies to the original data e.g. in hierarhical data), though in itself this is only a general thing. It is meaningful the same measre from other clusterings of the same data, meaning that you *can* roughly compare different solutions for expression of the original data, but only in a roughly converging way.
One trick is to cause the problem and test how varied results will be over mild variations over the data. You can for example repeat the clustering some amount of times with some noise, and record how often things change membership.
You can repeat the clustering omitting random pieces of data to lessen the effect of outliers  pieces of data that do not agree with the rest.
You can even aggregate the results from these runs and combine them into a sort of fuzzy cluster result that can show you instabilities, and/or converge on a clusters amount choice.
Evaluation
Silhouette coefficients
DaviesBouldin index
Gap statistic
Unsorted
Implementation notes
kmeans
The kmeans problem is finding a cluster labelling for a given amount of clusters (k) with minimal error, where the error function is based on the the withingroup sum of squares.
(For completeness, that means for all elements in a group, calculate the square of the euclidean distance to the centroid, and sum up all these squares, which gives percluster error values. Various convergence checks will want to know the sum of these errors)
Most implementations are iterative and look something like:
 Position k cluster centroids (at random, or sometimes slightly more cleverly)
 For each element, assign to the nearest centroid
 Recalculate (affected) centroid means (and often the error/energy at the same time)
 Check whether the moved centroids change the element assignments.
 If so, iterate
 If no change, we have converged and can stop
Of iterative clustering methods, kmeans is the simplest and many others can be said to be based on it.
Kmeans gives better better results if the value for k is a good choice, representative for the data.
(it's not unusual to try various k and test them)
The common 'nearest cluster' criterion will avoid attraction of multiple clusters, and the whole will converge to a decent solution for k groups.
Limitations:
 results are sensitive to initial placement, and it is easy to get stuck in a local minimum.
 It is not unusual to run the clustering various times with different starting clusters and see how stable the clustering is.
 If k is not representative of the structure in the data the solution may not be satisfying at all. This is partly caused by, and partly independent of, the fact that there may be various possible stable clusterings.
 the simple distance metric means we say the shape for inclusion is always a circle
Averagecase runtime is decent because it's a fairly simple algorithm.
Worstcase runtime, for fairly pathological datasets, is fairly quite high. There are faster approximations of kmeans that you may wish to consider.
You can tweak kmeans in various ways. For example, you can assign weights (based of frequency, importance, etc) to elements to affect the centroid mean recalculation.
See also:
 Wikipedia: Kmeans algorithm
 Arthur & Vassilvitskii, kmeans++ The Advantages of Careful Seeding
 JA Lloyd (1957), Least Squares Quantization in PCM
 EW Forgy (1965), Cluster analysis of multivariate data: efficiency vs interpretability of classifications
 MacQueen (1967), Some methods for classification and analysis of multivariate observations.
 Hartigan and Wong (1979), A Kmeans clustering algorithm
 Kanungo, Mount, Netanyahu, Piatko, Silverman, and Wu, An efficient kmeans clustering algorithm: Analysis and implementation
 Bottou, Bengio, Convergence Properties of the KMeans Algorithms
Variations:
 ISODATA (Iterative SelfOrganising Data Analysis Technique Algorithm) builds on kmeans and tries to be smart in initial positions and in variations of k(verify).
 Hmeans is a variation on kmeans that recalculates the centroid only after a complete iteration over all the items, not after each reassignment.
 Seen one way, it checks for error decrease less often, which makes it a smidge more sensitive to local minima and perhaps doesn't converge as nicely.
 Although the difference in practice tends to be slight, kmeans tends to be the slightly safer (and much more common) choice, even if the order it handles elements in is a different kind of bias that hmeans avoids.
Canopy clustering
https://en.wikipedia.org/wiki/Canopy_clustering_algorithm
Bisecting kmeans
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
hard cmeans
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
UPGMA, WPGMA
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
UPGMA = Unweighed Pair Group Method using Arithmetic averaging
WPGMA = Weighed Pair Group Method using Arithmetic averaging
(These specific names/abbreviations come from Sneath and Sokal 1973)
UPGMA assigns equal weight to all distances:
D((u,v),w) = (n_{u}*D(u,w) + n_{v}*D(v,w)) / (n_{u}+n_{u})
WPGMA uses:
D((u,v),w) = (n_{u}*D(u,w) + n_{v}*D(v,w)) / 2
In the unweighed variant, the two things being combined weigh equally,
in the weighed variant, all leaves that are part of a cluster weigh in as much as the other part.
Bottomup combiners working from a difference matrix, combining whatever leaf/cluster distance is minmal, then recalculating the difference matrix.
It's not too hard too argue this terminology is a little arbitrary
UPGMC, WPGMC
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
Unweighed Pair Group Method using Centroids, and Weighed Pair Group Method using Centroids
(the specific names/abbreviations come from Sneath and Sokal 1973)
Fuzzy cmeans (FCM)
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
 Type: fuzzy clustering (not really partitioning anymore)
Method: Like kmeans, but weighs centroid recentering calculations by fuzzy distance to all data points
http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/cmeans.html
http://www.scholarpedia.org/article/Fuzzy_CMeans_Cluster_Analysis
Fuzzy kmeans
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
Shell clustering
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
Basic fuzzy cmeans would include by a radius  a sphere.
There are various other options, including:
 fuzzy cquadric shells algorithm (FCQS) detects ellipsoids
 fuzzy cvarieties algorithm (FCV) detects infinite lines (linear manifolds) in 2D
 adaptive fuzzy cvarieties algorithm (AFC): detects line segments in 2D data
 fuzzy cshells algorithm (FCS) detects circles
 fuzzy cspherical shells algorithm (FCSS) detects circles
 fuzzy crings algorithm (FCR) detects circles
 fuzzy crectangular shells algorithm (FCRS) detects rectangles
 GathGeva algorithm (GG) detects ellipsoids
 GustafsonKessel algorithm (GK) detects ellipsoids of roughly the same size
PAM
Partitional, medoidbased
CLARA
Partitional, medoidbased
AGNES
Hierarchical
AGglomerative NESting
DIANA
Hierarchical
DIvisie ANAlysis
Buckshot
Hybrid
Birch
Hybrid
BITCH (balanced iterative reducing and clustering using hierarchies)
https://en.wikipedia.org/wiki/BIRCH
Cure
Hybrid
Clustering Using REpresentatives
https://en.wikipedia.org/wiki/CURE_algorithm
Rock
Hybrid
"Rock: A robust clustering algorithm for categorical attributes"
Chameleon
Hybrid, Hierarchical
"CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling"
DBSCAN
Densitybased.
The assumption that real objects will always be a dense cloud of points more easily rejects random points as noise/outliers (even if relatively close).
It can also deal decently with closeby nonlinear clusters, if separated cleanly.
http://en.wikipedia.org/wiki/DBSCAN
OPTICS
Similar to DBSCAN
https://en.wikipedia.org/wiki/OPTICS_algorithm
FLAME
Fuzzy, densitybased
http://en.wikipedia.org/wiki/FLAME_Clustering
Clustering by Committee (CBC)
Hybrid
Based on responsive elements (a comittee) voting on specific outcomes.
See also:
 http://www.aclweb.org/aclwiki/index.php?title=CBC
 PA Pantel (2003) Clustering by Committee [1]
Principal Direction Divisive Partitioning (PDDP)
Information Bottleneck
See also:
 https://en.wikipedia.org/wiki/Information_bottleneck_method
 N Slonim, "The Information Bottleneck: Theory and Applications"
Agglomerative Information Bottleneck
See also:
 N Slonim, N Tishby (1999) "Agglomerative Information Bottleneck"