Dimensionality reduction
This is more for overview of my own than for teaching or exercise.
Other data analysis, data summarization, learning

Contents
As a wide concept
Dimensionality reduction can be seen in a very wide sense, of creating a simpler variant of the data that focuses on the more interesting and removes the less interesting.
Note that in such a wide sense a lot of learning is useful as dimensionality reduction, just because the output is a useful and smaller thing, be it clustering, neural networks, whatever.
But in most cases it's also specific output, minimal output for that purpose.
Dimensionality reduction in practice is much more mellow version of that, reducing data to a more manageable form. It historically often referred to things like factor analysis and multivariate analysis, i.e. separating out what seem to be structural patterns, but not doing much with them yet, often acknowledging that we usually still have entangled surface effects in the observations given, and our direct output is probably still a poor view of the actual dependent variables that created the observations we look at.
(Whether that's an issue or not depends largely on what it's being used for)
Reducing the amount of dimensions in highly dimensional data is often done to alleviate the curse of dimensionality.[1]
Said curse comes mainly from the fact that for every dimension you add, the implied volume increases quicker and quicker.
So anything that wants to be exhausitve is now in trouble. Statistics has an exponentially harder job of proving significance (at least without exponential amounts more data), Machine learning needs as much more data to train well, optimization needs to consider all combinations of dimensions as variables, etc.
Distance metrics are funnier, in that the influence of any one dimension becomes smaller, so differences in metrics smaller  and more errorprone in things like clustering.
(verify)
It's not all bad, but certainly a thing to consider.
Ordination, Factor Analysis, Multivariate analysis
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
(See also Statistics#Multivariate_statistics)
Ordination widely means (re)ordering objects so that similar objects are near each other, and dissimilar objects are further away.
Ordination methods are often a step in something else, e.g. be good data massage before clustering.
It can become more relevant to higherdimensional data (lots of variables), in that ordination is something you watch (so sort of an implicit side effect) in the process of dimensionality reduction methods.
Of course, different methods have their own goals and focus, different requirements, and their own conventions along the way.
Typical methods include PCA, SVD, and
Factor Analysis, Principal Component Analysis (PCA), and variants
Correspondence Analysis (CA)
Conceptually similar to PCA, but uses a Chisquare distance, to be more appicable to nominal data (where PCA applies to continuous data).
See also:
Multidimensionsional scaling (MDS)
Refers to a group of similar analyses, with varying properties and methods, but which all focus on ordination
Commonly mentioned types of MDS include:
 Classical multidimensional scaling, a.k.a. Torgerson Scaling, Torgerson–Gower scaling.
 Metric multidimensional scaling
 Nonmetric multidimensional scaling
It is regularly used to provide a proximity visualization, so the target dimensions may be two or three simply because this is easier to plot.
Depending on how you count, there are somewhere between three and a dozen different MDS algorithms.
Some MDS methods closely resemble things like PCA, SVD, and others in how they change the data. Some more generic MDS variants are more on the descriptive side, so can be solved with PCA, SVD, and such.
A ordination, most try to not change the relative distances, but do change the coordinate system in the process.
Input
Input to many method is a similarity matrix  a square symmetric matrix often based on a similarity metric. Note some similar methods may be based on dissimilarity instead.
At a very pragmatic level, you may get
 a readymade similarity matrix
 items plus a method to compare them
 a table of items versus features (such as coordinates, preferences), with a method to compare them
 perceived similarities between each item (e.g in market research)
There is little difference, except in some assumption like whether the feature values are Euclidean, independent, orthogonal, and whatnot.
Result evaluation
MDS solutions tend to be fairly optimal, in that for the amount of target dimensions you will get the solution's distances correlating to the original data's distance's as well as they can.
There are still a number of things that help or hinder accuracy, primarily:
 the choice of input values
 the choice of target dimensions (since too few lead to ill placement choices)
 (to some degree) the type of MDS
 methods that have clusterlike implementations may be sensitive to initial state
You probably want to see how good a solution is.
The simplest method is probably calculating the correlation coefficient between input (dis)similarity and resulting data (dis)similarity, to show how much the MDS result fits the variation in the original. By rule of thumb, values below 0.6 mean a bad solution, and values above 0.8 or 0.9 are pretty good solutions (depending on the accuracy you want, but also varying with the of MDS).
Other methods include Kruskal's Stress, split data tests data stability tests (i.e., eliminating one item, see if result is similar) testretest reliability [2]
Algorithms
Note that in general, correlation can be complete (if k point in k1 dimensions, or distances between any two items are equal whichever way they are combined, e.g. if distances are those in euclidean space), but usually is not. The output's choice of axes is generally not particularly meaningful.
The most common example is principal coordinates analysis (PCO, PCoA), also known as Classical multidimensional scaling, Torgerson Scaling and TorgersonGower scaling, which is a single calculation step so does not require iteration or convergence testing.
See also (Torgerson, 1952), (Torgerson, 1958), (Gower, 1966), CMDSCALE in R
(Note: PCO (Principle Coordinate analysis) should not be confused with PCA (Principle Component Analysis) as it is not the same method, although apparently equivalent when the PCA kernel function is isotropic, e.g. is working on Euclidean coordinates/distance)(verify)
The first dimension in the result should capture the most variability (first principal coordinate), the second the second most (second principal coordinate), etc.
The eigenvectors of the input distance matrix yield the principal coordinates, and the eigenvalues give proportion of variance accounted for. As such, eigenvalue decomposition (or the more general singular value decomposition) can be used for this MDS method.
(The degree to which distances are violated can be estimated by how many small or negative eigenvalues. If there are none (...up to a given amount of dimensions...) then the analysis is probably reasonable), and you can use the eigenvalues to calculate how much of the total variance is accounted for(verify)  and you have the option of choosing afterwards how many dimensions you want to use (which you can't do in nonmetric).
Metric (multidimensional) scaling: a class of MDS that assumes dissimilarities are distances (and thereby also that they are symmetric).
May be used to indicate PCO, but is often meant to indicate a class based on something of a generalization, in that the stress function is more adjustable.
The optimization method used is often Stress Majorization (see also SMACOF [3] (Scaling by Majorizing A COmplicated Function)).
On iterative MDS methods:
minimize stress
no unique solution (so starting position may matter)
 stressbased MDS methods
 may be little more than nonparametric version of PCO(verify)
Nonmetric multidimensional scaling can be a broad name, but generally find a (nonparametric) monotonic relationship between [the dissimilarities in the itemitem matrix and the Euclidean distance between items] and [the location of each item in the lowdimensional space].
The optimization method is usually something like isotonic regression (which is due to monotonicity constraints). Methods regularly have both metric and nonmetric parts, and nonmetric scaling in the broad sense can describe quite varying methods (see e.g. Sammon's NLM).
Note that the the monotonic detail means that ranking of items ends up as more important than the (dis)similarities. This may be a more appropriate way of dealing with certain data, such as psychometric data, e.g. ratings of different items on an arbitrary scale.
Nonmetric MDS may give somewhat lowerstress solutions than metric MDS in the same amount of dimensions.(verify)
Also, certain implementations may deal better with nonnormality or varying error distributions (often by not making those assumptions).
Examples:
 Kruskal's nonmetric MDS(verify)
 ShepardKruskal Scaling(verify) (and (verify) whether that isn't the same thing as the last)
 Sammon nonlinear mapping [4]
Variations of algorithms can be described as:
 Replicated MDS: evaluates multiple matrices simultaneously
 Threeway scaling
 Multidimensional Unfolding
 Restricted MDS
Multidimensional scaling (MDS) [5]
 classical MDS (quite similar to PCA under some conditions, apparently when you use euclidean distance?)
 (Difference matrix > n dimensional coordintes (vectors))
 Kruskal's nonmetric MDS (R: isoMDS)
 Sammon's nonlinear mapping (R: sammon)
See also
 WS Torgerson (1958) Theory and Methods of Scaling
 JB Kruskal, and M Wish (1978) Multidimensional Scaling
 I Borg and P Groenen (2005) Modern Multidimensional Scaling: theory and applications
 TF Cox and MAA Cox (1994) Multidimensional Scaling
Generalizized MDS (GMDS)
A generalization of metric MDS where the target domain is nonEuclidean.
See also:
Sammon’s (nonlinear) mapping
Singular value decomposition (SVD)
See also:
 http://www.kwon3d.com/theory/jkinem/svd.html
 http://en.wikipedia.org/wiki/Singular_value_decomposition
 http://www.netlib.org/lapack/lug/node53.html
 http://public.lanl.gov/mewall/kluwer2002.html
Manifold learning
An approach to nonlinear dimensionality reduction.
Can be thought of an extension of PCAstyle techniques that is sensitive to nonlinear structure in data.
Expectation Maximisation (EM)
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
A broadly applicable idea/method that iteratively uses the Maximum Likelihood (ML) idea and its estimation (MLE).