Dimensionality reduction

From Helpful
Jump to: navigation, search
This is more for overview of my own than for teaching or exercise.
Arithmetic · 'elementary mathematics' and similar concepts


Math on data:

some introduction · areas of statistics
types of data · on random variables, distributions
Virtues and shortcomings of...
on sampling · probability
glossary · references, unsorted
Footnotes on various analyses

Other data analysis, data summarization, learning

Data massage
Data clustering · Dimensionality reduction · Fuzzy coding, decisions, learning · Optimization theory, control theory
Connectionism, neural nets · Evolutionary computing



As a wide concept

Dimensionality reduction can be seen in a very wide sense, of creating a simpler variant of the data that focuses on the more interesting and removes the less interesting.


Note that in such a wide sense a lot of learning is useful as dimensionality reduction, just because the output is a useful and smaller thing, be it clustering, neural networks, whatever.

But in most cases it's also specific output, minimal output for that purpose.


Dimensionality reduction in practice is much more mellow version of that, reducing data to a more manageable form. It historically often referred to things like factor analysis and multivariate analysis, i.e. separating out what seem to be structural patterns, but not doing much with them yet, often acknowledging that we usually still have entangled surface effects in the observations given, and our direct output is probably still a poor view of the actual dependent variables that created the observations we look at. (Whether that's an issue or not depends largely on what it's being used for)


Reducing the amount of dimensions in highly dimensional data is often done to alleviate the curse of dimensionality.[1]

Said curse comes mainly from the fact that for every dimension you add, the implied volume increases quicker and quicker.

So anything that wants to be exhausitve is now in trouble. Statistics has an exponentially harder job of proving significance (at least without exponential amounts more data), Machine learning needs as much more data to train well, optimization needs to consider all combinations of dimensions as variables, etc.


Distance metrics are funnier, in that the influence of any one dimension becomes smaller, so differences in metrics smaller - and more error-prone in things like clustering. (verify)

It's not all bad, but certainly a thing to consider.


Ordination, Factor Analysis, Multivariate analysis

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

(See also Statistics#Multivariate_statistics)


Ordination widely means (re)ordering objects so that similar objects are near each other, and dissimilar objects are further away.


Ordination methods are often a step in something else, e.g. be good data massage before clustering.

It can become more relevant to higher-dimensional data (lots of variables), in that ordination is something you watch (so sort of an implicit side effect) in the process of dimensionality reduction methods.

Of course, different methods have their own goals and focus, different requirements, and their own conventions along the way.

Typical methods include PCA, SVD, and




Factor Analysis, Principal Component Analysis (PCA), and variants

Correspondence Analysis (CA)

Conceptually similar to PCA, but uses a Chi-square distance, to be more appicable to nominal data (where PCA applies to continuous data).


See also:

Multi-dimensionsional scaling (MDS)

Refers to a group of similar analyses, with varying properties and methods, but which all focus on ordination

Commonly mentioned types of MDS include:

  • Classical multidimensional scaling, a.k.a. Torgerson Scaling, Torgerson–Gower scaling.
  • Metric multidimensional scaling
  • Non-metric multidimensional scaling


It is regularly used to provide a proximity visualization, so the target dimensions may be two or three simply because this is easier to plot.


Depending on how you count, there are somewhere between three and a dozen different MDS algorithms.

Some MDS methods closely resemble things like PCA, SVD, and others in how they change the data. Some more generic MDS variants are more on the descriptive side, so can be solved with PCA, SVD, and such.


A ordination, most try to not change the relative distances, but do change the coordinate system in the process.


Input

Input to many method is a similarity matrix - a square symmetric matrix often based on a similarity metric. Note some similar methods may be based on dissimilarity instead.


At a very pragmatic level, you may get

  • a ready-made similarity matrix
  • items plus a method to compare them
  • a table of items versus features (such as coordinates, preferences), with a method to compare them
  • perceived similarities between each item (e.g in market research)

There is little difference, except in some assumption like whether the feature values are Euclidean, independent, orthogonal, and whatnot.


Result evaluation

MDS solutions tend to be fairly optimal, in that for the amount of target dimensions you will get the solution's distances correlating to the original data's distance's as well as they can.

There are still a number of things that help or hinder accuracy, primarily:

  • the choice of input values
  • the choice of target dimensions (since too few lead to ill placement choices)
  • (to some degree) the type of MDS
  • methods that have cluster-like implementations may be sensitive to initial state


You probably want to see how good a solution is.

The simplest method is probably calculating the correlation coefficient between input (dis)similarity and resulting data (dis)similarity, to show how much the MDS result fits the variation in the original. By rule of thumb, values below 0.6 mean a bad solution, and values above 0.8 or 0.9 are pretty good solutions (depending on the accuracy you want, but also varying with the of MDS).

Other methods include Kruskal's Stress, split data tests data stability tests (i.e., eliminating one item, see if result is similar) test-retest reliability [2]


Algorithms

Note that in general, correlation can be complete (if k point in k-1 dimensions, or distances between any two items are equal whichever way they are combined, e.g. if distances are those in euclidean space), but usually is not. The output's choice of axes is generally not particularly meaningful.


The most common example is principal coordinates analysis (PCO, PCoA), also known as Classical multidimensional scaling, Torgerson Scaling and Torgerson-Gower scaling, which is a single calculation step so does not require iteration or convergence testing.

See also (Torgerson, 1952), (Torgerson, 1958), (Gower, 1966), CMDSCALE in R

(Note: PCO (Principle Coordinate analysis) should not be confused with PCA (Principle Component Analysis) as it is not the same method, although apparently equivalent when the PCA kernel function is isotropic, e.g. is working on Euclidean coordinates/distance)(verify)


The first dimension in the result should capture the most variability (first principal coordinate), the second the second most (second principal coordinate), etc. The eigenvectors of the input distance matrix yield the principal coordinates, and the eigenvalues give proportion of variance accounted for. As such, eigenvalue decomposition (or the more general singular value decomposition) can be used for this MDS method. (The degree to which distances are violated can be estimated by how many small or negative eigenvalues. If there are none (...up to a given amount of dimensions...) then the analysis is probably reasonable), and you can use the eigenvalues to calculate how much of the total variance is accounted for(verify) - and you have the option of choosing afterwards how many dimensions you want to use (which you can't do in non-metric).




Metric (multidimensional) scaling: a class of MDS that assumes dissimilarities are distances (and thereby also that they are symmetric). May be used to indicate PCO, but is often meant to indicate a class based on something of a generalization, in that the stress function is more adjustable. The optimization method used is often Stress Majorization (see also SMACOF [3] (Scaling by Majorizing A COmplicated Function)).




On iterative MDS methods: minimize stress no unique solution (so starting position may matter)

  • stress-based MDS methods
  • may be little more than non-parametric version of PCO(verify)



Non-metric multidimensional scaling can be a broad name, but generally find a (non-parametric) monotonic relationship between [the dissimilarities in the item-item matrix and the Euclidean distance between items] and [the location of each item in the low-dimensional space].

The optimization method is usually something like isotonic regression (which is due to monotonicity constraints). Methods regularly have both metric and non-metric parts, and non-metric scaling in the broad sense can describe quite varying methods (see e.g. Sammon's NLM).

Note that the the monotonic detail means that ranking of items ends up as more important than the (dis)similarities. This may be a more appropriate way of dealing with certain data, such as psychometric data, e.g. ratings of different items on an arbitrary scale.

Non-metric MDS may give somewhat lower-stress solutions than metric MDS in the same amount of dimensions.(verify)

Also, certain implementations may deal better with non-normality or varying error distributions (often by not making those assumptions).


Examples:

  • Kruskal's non-metric MDS(verify)
  • Shepard-Kruskal Scaling(verify) (and (verify) whether that isn't the same thing as the last)
  • Sammon non-linear mapping [4]


Variations of algorithms can be described as:

  • Replicated MDS: evaluates multiple matrices simultaneously
  • Three-way scaling
  • Multidimensional Unfolding
  • Restricted MDS




Multidimensional scaling (MDS) [5]

  • classical MDS (quite similar to PCA under some conditions, apparently when you use euclidean distance?)
    • (Difference matrix -> n dimensional coordintes (vectors))
  • Kruskal's non-metric MDS (R: isoMDS)
  • Sammon's non-linear mapping (R: sammon)


See also
  • WS Torgerson (1958) Theory and Methods of Scaling
  • JB Kruskal, and M Wish (1978) Multidimensional Scaling
  • I Borg and P Groenen (2005) Modern Multidimensional Scaling: theory and applications
  • TF Cox and MAA Cox (1994) Multidimensional Scaling


Generalizized MDS (GMDS)

A generalization of metric MDS where the target domain is non-Euclidean.

See also:


Sammon’s (non-linear) mapping

Singular value decomposition (SVD)

See also:

Manifold learning

An approach to non-linear dimensionality reduction.

Can be thought of an extension of PCA-style techniques that is sensitive to non-linear structure in data.


Expectation Maximisation (EM)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A broadly applicable idea/method that iteratively uses the Maximum Likelihood (ML) idea and its estimation (MLE).