Difference between revisions of "Dimensionality reduction"

From Helpful
Jump to: navigation, search
 
m
Line 1: Line 1:
#redirect [[Data_modeling,_restructuring,_analysis,_fuzzy_cases,_learning#Types_of_problems]]
+
{{math notes}}
 +
 
 +
===As a wide concept===
 +
 
 +
Dimensionality reduction can be seen in a very wide sense, of creating a simpler variant of the data that focuses on the more interesting and removes the less interesting.
 +
 
 +
 
 +
Note that in such a wide sense a lot of learning is useful as dimensionality reduction, just because the output is a useful and smaller thing, be it clustering, neural networks, whatever.
 +
 
 +
But in most cases it's also specific output, minimal output for that purpose.
 +
 
 +
 
 +
Dimensionality reduction in practice is much more mellow version of that, reducing data to a more manageable form.  It historically often referred to things like factor analysis and multivariate analysis, i.e. separating out what seem to be structural patterns, but not doing much with them yet, often acknowledging that we usually still have entangled surface effects in the observations given, and our direct output is probably still a poor view of the actual dependent variables that created the observations we look at.
 +
(Whether that's an issue or not depends largely on what it's being used for)
 +
 
 +
 
 +
Reducing the amount of dimensions in highly dimensional data is often done to alleviate the '''curse of dimensionality'''.[https://en.wikipedia.org/wiki/Curse_of_dimensionality]
 +
 
 +
Said curse comes mainly from the fact that for every dimension you add, the implied volume increases quicker and quicker.
 +
 
 +
So anything that wants to be exhausitve is now in trouble.
 +
Statistics has an exponentially harder job of proving significance (at least without exponential amounts more data),
 +
Machine learning needs as much more data to train well,
 +
optimization needs to consider all combinations of dimensions as variables, etc.
 +
 
 +
 
 +
Distance metrics are funnier, in that the influence of any one dimension becomes smaller, so differences in metrics smaller - and more error-prone in things like clustering.
 +
{{verify}}
 +
 
 +
It's not all bad, but certainly a thing to consider.
 +
 
 +
 
 +
<!--
 +
 
 +
In includes organizing data into a simpler form, classifying and/or partitioning in the process, often to ends like extracting patterns, or reducing the dimensionality of the data, sometimes so much as to draw simple conclusions. Along with those go various methods of to testing the apparent accuracy of the new form, and statistically proving it isn't just one bad solution among many.
 +
 
 +
 
 +
Methods or results are rarely provably optimal, but you can usually estimate how much they agree with the input data.
 +
-->
 +
<!--
 +
Like statistics, you reduce data into less data but make implicit decisions along the way, and like statistics, some methods are better for certain types of data and may be structurally weaker than others.
 +
 
 +
 
 +
Pattern recognition may work on a settled, well-dimensioned form of data, where specific methods may be applied and known to be good. In certain other cases, the dat is in a much looser form for the question to be answered. Consider, for example, an AGCT sort of string of DNA - it is not immediately clear where sequences (or even codons) start or end.
 +
-->
 +
 
 +
===Ordination, Factor Analysis, Multivariate analysis===
 +
{{stub}}
 +
 
 +
(See also [[Statistics#Multivariate_statistics]])
 +
 
 +
 
 +
[http://en.wikipedia.org/wiki/Ordination_%28statistics%29 Ordination] widely means (re)ordering objects so that similar objects are near each other, and dissimilar objects are further away.
 +
 
 +
 
 +
Ordination methods are often a step in something else, e.g. be good data massage before clustering.
 +
 
 +
It can become more relevant to higher-dimensional data (lots of variables), in that ordination is something you watch
 +
(so sort of an implicit side effect) in the process of dimensionality reduction methods.
 +
 
 +
Of course, different methods have their own goals and focus, different requirements, and their own conventions along the way.
 +
 
 +
Typical methods include PCA, SVD, and
 +
 
 +
 
 +
<!--
 +
Many methods resemble each other. Some are different in their fundaments,
 +
some are largely equivalent,
 +
some are actually much the same but have different names in different fields (and a few conditions/assumptions specific to that field),
 +
 
 +
(often being defined as a lower-variate expression of much the same information (in terms of a metric, often distance / variance).
 +
 
 +
For example, in basic eigendecomposition you get same-sized output, but with the most variance explanation in the first variable, the second most in the second, etc, so you can take the first few and get a decent approximation of the largest patterns in the data.
 +
 
 +
This can also prove useful in efficient-to-execute clustering, some learning algorithms, efficient approximated storage, and such.
 +
-->
 +
 
 +
<!--
 +
http://ordination.okstate.edu/
 +
-->
 +
 
 +
<!--
 +
 
 +
This makes it hard to make a short summary/typology, and easier to do individual comparisons.
 +
For example, take PCA, MDS, and SVD.
 +
 
 +
 
 +
In their goal they are largely the same: They compute an orthogonal transform that de-correlates the variables. And make it easy to address/keep the most significant ones.
 +
 
 +
Numerically this is often solved by either SVD of the data matrix (after centering), or an eigendecomposition of the covariance matrix.
 +
 
 +
 
 +
 
 +
MDS is itself a type of analysis, a description, or rather a set of related descriptions, where 'classical MDS' is the more general one.
 +
The solution to classical MDS is often an eigenvalue method (frequently PCA). Other MDS variants are more complex.
 +
 
 +
 
 +
Classical MDS minimizes dimensions while preserving distance, and e.g. PCA minimizes dimensions while preserving covariance (it works on a covariance matrix), which is equivalent ''if'' the covariance is based on euclidean distance.
 +
 
 +
And not if it's another metric. So you could say that PCA solves or ''is'' a specific case of MDS.
 +
 
 +
 
 +
In turn, that descirption of PCA is still a description - that of maximizing the explained variance.
 +
If written down more formally, it turns out this is equivalent to the eigenvalue problem,
 +
of finding eigenvalues and eigenvectors of the covariance matrix. 
 +
Which is why PCA is usually explained as being that eigendecomposition of the covariance matrix.
 +
 
 +
SVD on the data matrix itself (after [[mean centering]]) also
 +
 
 +
PCA of matrix X is eigendecomposition of XX<sup>⊤</sup> (note: symmetric, and called a scatter matrix)
 +
SVD
 +
 
 +
 
 +
The results of SVD and PCA are often much the same in expression, though often not numerically.
 +
 
 +
 
 +
For pragmatic reasons SVD can be preferable to PCA the covariance-matrix way, in that the latter's calculation loses some precision (because for some matrices).
 +
 
 +
 
 +
 
 +
 
 +
 
 +
 
 +
 
 +
 
 +
For example, there are a handful of eigendecomposition-based principal-factor style ordination (e.g. PCA and some variants, some variants of MDS, ): When your input is a (variance-)covariance matrix or correlation matrix, the resulting eigenvectors correspond to principal components, and the eigenvalues to the relative amount of variance explained by that principal component.
 +
 
 +
 
 +
The principal components ''may'' show a fairly isolated effect that you ''may'' be able to name. (This depends on the application, and is often only true for the first few).
 +
But keep in mind that they are in the end mathematical constructs and should not be taken too literally.
 +
 
 +
 
 +
Often enough, when you think dimensionality reduction would be a good idea for some data it is, and it often turns out that most of the variance is covered by the first dozen PCs.
 +
 
 +
The amount of eigenvectors/components to extract/use is fairly arbitrary and ad hoc.
 +
As a rule of thumb, when adding further PCs makes little difference (e.g. to the correlation with the original data set), you have enough to work with. If removing one changes the data noticeably, you probably shouldn't have.
 +
 
 +
See also scree plots, which plot the amount of the overall variation delivered by each PC. It might be nice to also get a cumulative fraction of variance covered by the first that-many PCs (and perhaps the correlation to the original data).
 +
 
 +
-->
 +
<!--
 +
Some of the more better known names in ordination include:
 +
-->
 +
 
 +
 
 +
===Factor Analysis, Principal Component Analysis (PCA), and variants===
 +
<!--
 +
Eigenvector-style analysis with an Euclidean metric
 +
 
 +
Functionally, it projects input data into a lower-dimensional orthogonal subspace
 +
that captures as much of the variation of the data as possible. This is ''usually'' used for dimensionality reduction.
 +
 
 +
 
 +
It is originates in the [https://en.wikipedia.org/wiki/Principal_axis_theorem principal axis theorem]
 +
and was developed by Pearson in 1901, and independently by Hotelling in the 1930s.
 +
 
 +
There are a set of closely related developments (identical in many but not necessarily all respects), or
 +
of close variants are known in different fields as:
 +
* the discrete Karhunen–Loève transform / Kosambi-Karhunen–Loève transform (KLT) (signal processing)
 +
* the Hotelling transform (multivariate quality control)
 +
* proper orthogonal decomposition (POD) (mechanical engineering)
 +
* Empirical orthogonal functions (meteorology, geophysics)
 +
* empirical modal analysis (structural dynamics).
 +
* spectral decomposition (noise and vibration)
 +
* Schmidt–Mirsky theorem (psychometrics),
 +
* empirical orthogonal functions (EOF) (meteorology),
 +
 
 +
* Eigenvector decomposition (linear algebra)
 +
* Singular Value Decomposition (SVD) (linear algebra)
 +
 
 +
* Factor analysis (),
 +
* Eckart–Young theorem
 +
* empirical eigenfunction decomposition,
 +
* empirical component analysis
 +
* quasiharmonic modes
 +
 
 +
 
 +
These are not all the same, details will vary.
 +
 
 +
-->
 +
<!--
 +
 
 +
The field of eigenvector ''approach'' is wider, and includes similar analyses, such as classic MDS.
 +
 
 +
 
 +
PCA is technically part of a wider group of techniques, Factor Analysis[http://en.wikipedia.org/wiki/Factor_Analysis].
 +
PCA is the best known and the most used of them.
 +
 
 +
 
 +
Methods can be divided into global and local(-and-typically-iterative) solutions.
 +
 
 +
The global are the most basic approaches:
 +
* eigenvalue decomposition of a data covariance (or correlation) matrix
 +
: requires the computation of the covariance matrix, effectively making the whole O(n<sup>np<sup>2</sup></sup> (p is data dimensionality, n is amount of samples))
 +
: simply ''does not scale'' to large p, i.e. higher-dimensional data
 +
 
 +
* singular value decomposition of a data matrix
 +
: similar issue to the above
 +
 
 +
 
 +
The local ones:
 +
* iterative:
 +
: calculates first without the covariance matrix
 +
: iteration: subtract that solution and do again for the second, etc.
 +
: flaws: errors carry over, so only sensible for a smallish amount if
 +
 
 +
* NIPALS (Non-linear iterative partial least squares)
 +
: efficient at only calculating the first few principal components
 +
: probably the most standard PCA
 +
 
 +
: Wold et al., 1987, "{{search|Principal Component Analysis wold 1987 Chemometrics and Intelligent Laboratory Systems|Principal Component Analysis}}"
 +
 
 +
 
 +
 
 +
sequential methods
 +
: updates state as new examples come in.
 +
 
 +
 
 +
 
 +
 
 +
See also:
 +
* https://en.wikipedia.org/wiki/Lanczos_algorithm
 +
 
 +
 
 +
 
 +
http://www.statsoft.com/textbook/stfacan.html
 +
 
 +
 
 +
As a commonly used method, it has various variations, including:
 +
 
 +
=====Principal Coordinate Analysis (PCoA, PCoorA, PCO)=====
 +
 
 +
 
 +
=====Kernel PCA=====
 +
Basically the application of kernel methods[http://en.wikipedia.org/wiki/Kernel_methods]
 +
before doing the PCA.
 +
 
 +
 
 +
 
 +
 
 +
-->
 +
 
 +
===Correspondence Analysis (CA)===
 +
Conceptually similar to PCA, but uses a Chi-square distance,
 +
to be more appicable to [[nominal data]] (where PCA applies to [[continuous data]]).
 +
 
 +
 
 +
See also:
 +
* http://en.wikipedia.org/wiki/Correspondence_Analysis
 +
 
 +
===Multi-dimensionsional scaling (MDS)===
 +
 
 +
Refers to a group of similar analyses, with varying properties and methods,
 +
but which all focus on ordination<!--:
 +
They find an embedding in a metric space (and a suitable dimension)
 +
based on dissimilarity(/similarity) of the items it shows.
 +
 
 +
Eigenvector-based ordination, one of the better known among its relatives.
 +
-->
 +
 
 +
Commonly mentioned types of MDS include:
 +
* Classical multidimensional scaling, a.k.a. Torgerson Scaling, Torgerson–Gower scaling. <!--
 +
** apparently ensures we are working on similarities, then executes PCA?{{verify}}  Minimizes a loss function.-->
 +
 
 +
* Metric multidimensional scaling
 +
 
 +
* Non-metric multidimensional scaling
 +
 
 +
 
 +
It is regularly used to provide a proximity visualization, so the target dimensions may be two or three simply because this is easier to plot.
 +
 
 +
 
 +
Depending on how you count, there are somewhere between three and a dozen different MDS algorithms.
 +
 
 +
Some MDS methods closely resemble things like PCA, SVD, and others in how they change the data.
 +
Some more generic MDS variants are more on the descriptive side, so can be ''solved'' with PCA, SVD, and such.
 +
 
 +
 
 +
A ordination, most try to not change the relative distances, but do change the coordinate system in the process.
 +
 
 +
 
 +
 
 +
=====Input=====
 +
 
 +
Input to many method is a similarity matrix - a square symmetric matrix often based on a similarity metric.
 +
Note some similar methods may be based on ''dis''similarity instead.
 +
 
 +
 
 +
At a very pragmatic level, you may get
 +
* a ready-made similarity matrix
 +
 
 +
* items plus a method to compare them
 +
 
 +
* a table of items versus features (such as coordinates, preferences), with a method to compare them
 +
 
 +
* perceived similarities between each item (e.g in market research)
 +
 
 +
There is little difference, except in some assumption like whether the feature values are Euclidean, independent, orthogonal, and whatnot.
 +
 
 +
 
 +
 
 +
=====Result evaluation=====
 +
MDS solutions tend to be fairly optimal, in that ''for the amount of target dimensions'' you will get the solution's distances correlating to the original data's distance's as well as they can.
 +
 
 +
There are still a number of things that help or hinder accuracy, primarily:
 +
* the choice of input values
 +
* the choice of target dimensions (since too few lead to ill placement choices)
 +
* (to some degree) the type of MDS
 +
* methods that have cluster-like implementations may be sensitive to initial state
 +
 
 +
 
 +
You probably want to see how good a solution is.
 +
 
 +
The simplest method is probably calculating the [[correlation coefficient]] between input (dis)similarity and resulting data (dis)similarity, to show how much the MDS result fits the variation in the original. By rule of thumb, values below 0.6 mean a bad solution, and values above 0.8 or 0.9 are pretty good solutions (depending on the accuracy you want, but also varying with the of MDS).
 +
 
 +
Other methods include
 +
Kruskal's Stress,
 +
split data tests
 +
data stability tests (i.e., eliminating one item, see if result is similar)
 +
test-retest reliability [http://en.wikipedia.org/wiki/Test-retest]
 +
 
 +
 
 +
=====Algorithms=====
 +
 
 +
Note that in general, correlation can be complete (if k point in k-1 dimensions, or distances between any two items are equal whichever way they are combined, e.g. if distances are those in euclidean space), but usually is not.
 +
The output's choice of axes is generally not particularly meaningful.
 +
 
 +
 
 +
 
 +
The most common example is '''principal coordinates analysis (PCO, PCoA)''', also known as '''Classical multidimensional scaling''', '''Torgerson Scaling''' and '''Torgerson-Gower scaling''', which is a single calculation step so does not require iteration or convergence testing.
 +
 
 +
See also (Torgerson, 1952), (Torgerson, 1958), (Gower, 1966), [http://www.google.com/search?hl=en&q=site%3Ar-project.org%20CMDSCALE CMDSCALE in R]
 +
 
 +
{{comment|(Note: PCO (Principle Coordinate analysis) should not be confused with PCA (Principle ''Component'' Analysis) as it is not the same method, although apparently equivalent
 +
when the PCA kernel function is isotropic, e.g. is working on Euclidean coordinates/distance)}}{{verify}}
 +
 
 +
 
 +
The first dimension in the result should capture the most variability (first principal coordinate), the second the second most (second principal coordinate), etc.
 +
The eigenvectors of the input distance matrix yield the principal coordinates, and the eigenvalues give proportion of variance accounted for. As such, eigenvalue decomposition (or the more general singular value decomposition) can be used for this MDS method.
 +
(The degree to which distances are violated can be estimated by how many small or negative eigenvalues. If there are none (...up to a given amount of dimensions...) then the analysis is probably reasonable), and you can use the eigenvalues to calculate how much of the total variance is accounted for{{verify}} - and you have the option of choosing afterwards how many dimensions you want to use (which you can't do in non-metric).
 +
 
 +
 
 +
 
 +
 
 +
 
 +
 
 +
'''Metric (multidimensional) scaling''': a class of MDS that assumes dissimilarities are distances (and thereby also that they are symmetric).
 +
May be used to indicate PCO, but is often meant to indicate a class based on something of a generalization, in that the stress function is more adjustable.
 +
The optimization method used is often [http://en.wikipedia.org/wiki/Stress_Majorization Stress Majorization] (see also SMACOF [http://en.wikipedia.org/wiki/Stress_majorization#The_SMACOF_algorithm] (Scaling by Majorizing A COmplicated Function)).
 +
 
 +
 
 +
 
 +
 
 +
 
 +
 
 +
On iterative MDS methods:
 +
minimize stress
 +
no unique solution (so starting position may matter)
 +
* stress-based MDS methods
 +
* may be little more than non-parametric version of PCO{{verify}}
 +
 
 +
 
 +
 
 +
 
 +
'''Non-metric multidimensional scaling''' can be a broad name, but generally find a (non-parametric) ''monotonic'' relationship between [the dissimilarities in the item-item matrix and the Euclidean distance between items] and [the location of each item in the low-dimensional space].
 +
 
 +
The optimization method is usually something like isotonic regression (which is due to monotonicity constraints). Methods regularly have both metric and non-metric parts, and non-metric scaling in the broad sense can describe quite varying methods (see e.g. Sammon's NLM).
 +
 
 +
Note that the the monotonic detail means that ranking of items ends up as more important than the (dis)similarities. This may be a more appropriate way of dealing with certain data, such as psychometric data, e.g. ratings of different items on an arbitrary scale.
 +
 
 +
Non-metric MDS may give somewhat lower-stress solutions than metric MDS in the same amount of dimensions.{{verify}}
 +
 
 +
Also, certain implementations may deal better with non-normality or varying error distributions (often by not making those assumptions).
 +
 
 +
 
 +
 
 +
Examples:
 +
* Kruskal's non-metric MDS{{verify}}
 +
* Shepard-Kruskal Scaling{{verify}} (and {{verify}} whether that isn't the same thing as the last)
 +
 
 +
* Sammon non-linear mapping [http://www.google.com/search?hl=en&q=%22R%20Documentation%22%20%20sammon]
 +
 
 +
 
 +
 
 +
Variations of algorithms can be described as:
 +
* '''Replicated MDS''': evaluates multiple matrices simultaneously
 +
 
 +
* '''Three-way scaling'''
 +
 
 +
* '''Multidimensional Unfolding'''
 +
 
 +
* '''Restricted MDS'''
 +
 
 +
 
 +
 
 +
 
 +
 
 +
 
 +
Multidimensional scaling (MDS) [http://www.analytictech.com/networks/mds.htm]
 +
* classical MDS (quite similar to PCA under some conditions, apparently when you use euclidean distance?)
 +
** (Difference matrix -> n dimensional coordintes (vectors))
 +
* Kruskal's non-metric MDS (R: isoMDS)
 +
* Sammon's non-linear mapping (R: sammon)
 +
 
 +
 
 +
=====See also=====
 +
* WS Torgerson (1958) ''Theory and Methods of Scaling''
 +
* JB Kruskal, and M Wish (1978) ''Multidimensional Scaling''
 +
* I Borg and P Groenen (2005) ''Modern Multidimensional Scaling: theory and applications''
 +
* TF Cox and MAA Cox (1994) ''Multidimensional Scaling''
 +
 
 +
 
 +
====Generalizized MDS (GMDS)====
 +
A generalization of metric MDS where the target domain is non-Euclidean.
 +
 
 +
See also:
 +
* http://en.wikipedia.org/wiki/Generalized_multidimensional_scaling_(GMDS)
 +
 
 +
 
 +
====Sammon’s (non-linear) mapping====
 +
<!--
 +
 
 +
-->
 +
 
 +
===Singular value decomposition (SVD)===
 +
 
 +
<!--
 +
One of the classical types of ordination.
 +
 
 +
A matrix factoring/orthogonalizing technique, used in statistics and signal processing/analysis.
 +
 
 +
Seen used in classical MDS, Principal Coordinate Analysis (abbreviated PCoA, PCoorA, PCO), one way of doing Principal Component analysis (PCA), latent semantic indexing/analysis, and more.{{verify}}
 +
 
 +
 
 +
Can be seen to do a few different ways:
 +
* a transformation of correlated variables to uncorrelated variables
 +
* seeing which variables/dimensions explain the most variation in the data, and ordering them for that {{verify}}
 +
* reducing data, using those last two facts
 +
 
 +
Since a SVD solution calculates its own basis, it means that effects rotated to play in multiple dimensions are likely to be fairly isolated in a single one in the output.
 +
 
 +
 
 +
 
 +
Takes an m-by-n real-valued matrix A, where m &ge; n.
 +
 
 +
Generally, this input is individual objects by features for each, which can be seen as more everday than most other sorts of matrices - many summarizing tables take that form. Other examples include time and FFT bands, documents and terms in them, or such.
 +
 
 +
 
 +
 
 +
 
 +
The idea is to separate a matrix A into a soltuion to to ''A=U*S*V<sup>T</sup>'' {{comment|(also W or D instead of S or &#x3A3;, this seems purely naming convention and to have no difference in meaning)}}{{verify}}.
 +
 
 +
Accordingly, SVD produces a solution in the form of three such matrices:
 +
* U is an m-by-n (column-orthogonal) matrix, the 'left singular vectors' (an orthogonal basis of K<sup>m</sup>)
 +
* S is a n-by-n diagonal matrix, with only positive elements
 +
* V<sup>T</sup>, and the transpose of an n-by-n orthogonal, contain the 'right singular vectors' (an orthogonal basis of K<sup>n</sup>)
 +
 
 +
The columns U<sub>i</sub> and V<sub>i</sub> are called the left and right singular vectors, the elements S<sub>i,i</sub> (positive reals) are the singular values.
 +
 
 +
The left vector is mapped onto the right vector with the value: ''A v<sub>i</sub> = S<sub>ii</sub> U<sub>i</sub>''.
 +
Rougly speaking, U is the base, S stretches/scales it, and V adjusts/aligns it.
 +
 
 +
 
 +
Notes:
 +
* The three matrices are necessary to reconstruct the original data, but note that certain uses of SVD, such as Latent Semantic Analysis, are only interested in the orthogonalization and/or reduction, so may omit some part of the calculation (apparently omitting U yielding only S and V).
 +
 
 +
* The SVD always exists if A is real or complex (and under other conditions that can be handled inside an implementation).
 +
The solution is singular under certain conditions (which?{{verify}}).
 +
 
 +
You can use limited data from these three matrices to approximate A with fair and fairly noise-resilient accuracy. That is, A<sup>l</sup> being the sum (for i=1..l) of ''u<sub>i</sub> s<sub>i</sub> v<sup>T</sup><sub>i</sub>'' minimizes the sum of the squares between the elements of that summed matrix and and A - i.e. that result is the closest l-rank matrix to A.
 +
 
 +
 
 +
Other things you can do include:
 +
* dimensionality reduction in a principal component analysis style
 +
 
 +
* If the input was FFT bands, you can take out the S matrix and treat it as a rough characterization of the frequency bands in A, because the The S<sub>i,i</sub> values reflect how important the (in the original case of the FFT) frequency band is in reconstructing the data -- assuming the least-squares metric is a good measure. For intensity data like light and sound you may want to apply a log for some more accuracy.
 +
 
 +
* When the m{{verify}} dimension is continuous sampling, the plot of V<sup>t</sup> may reveal interesting patterns (related to that continuity)
 +
 
 +
* The effective rank (using a halfway smart threshold value on the s<sub>i</sub> values) probably roughly reflects the complexity of the music. If taking the SVD in a windowed way, the variance (in an entropy way) should reflect change over time.
 +
 
 +
* (column-orthogonal means that U*U<sup>t</sup>, U<sup>t</sup>*U, V*V<sup>t</sup>, and V<sup>t</sup>*V all evaluate to identity matrices)
 +
-->
 +
 
 +
 
 +
See also:
 +
* http://www.kwon3d.com/theory/jkinem/svd.html
 +
* http://en.wikipedia.org/wiki/Singular_value_decomposition
 +
* http://www.netlib.org/lapack/lug/node53.html
 +
* http://public.lanl.gov/mewall/kluwer2002.html
 +
 
 +
===Manifold learning===
 +
 
 +
An approach to non-linear dimensionality reduction.
 +
 
 +
Can be thought of an extension of PCA-style techniques that is sensitive to non-linear structure in data.
 +
 
 +
 
 +
<!--
 +
Applies to cases where the dimensionality is high more artificially due to how it's being measured,
 +
 
 +
 
 +
t-SNE/TSNE
 +
https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding
 +
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
 +
 
 +
 
 +
https://scikit-learn.org/stable/modules/manifold.html
 +
-->
 +
 
 +
===Expectation Maximisation (EM)===
 +
{{stub}}
 +
 
 +
A broadly applicable idea/method that iteratively uses the Maximum Likelihood (ML) idea and its estimation (MLE).
 +
 
 +
<!--
 +
TODO: See also [http://www.montefiore.ulg.ac.be/~piater/Cours/INFO0013/Notes/16/toc.xhtml], [http://www.informit.com/articles/article.asp?p=363730&rl=1], [http://www.cs.wisc.edu/~mschultz/cs540/em.html], [http://sifaka.cs.uiuc.edu/course/397cxz03f/em-note.pdf], [http://grb.mnsu.edu/grbts/doc/manual/Expectation_Maximization_EM.html], [http://research.microsoft.com/research/pubs/view.aspx?tr_id=192]
 +
 
 +
Related: RANSAC.
 +
 
 +
 
 +
=====EM for missing-data problems=====
 +
{{stub}}
 +
 
 +
When you have an imperfect model and an incomplete set of observations, and wish to augment the model based on observations (based on a means of estimation), the methods iterates:
 +
* '''E''': Estimate fictitious data based on current model and observations
 +
* '''M''': Adjust the model to maximize the likelihood of estimating the current data (including estimated data points)
 +
 
 +
Output can be that model and/or the data estimated with it.
 +
 
 +
It is comparable to machine learning like Bayesian methods in its prediction-by-observation.
 +
 
 +
Note: EM easily converges to local maxima; some random initialisation to see whether this varies or not can be called for.
 +
 
 +
 
 +
=====EM in clustering=====
 +
{{stub}}
 +
 
 +
When used for clustering, EM can be seen as a soft version of k-means.
 +
(See also [[Mixture Models]])
 +
 
 +
 
 +
* amount of clusters fixed beforehand
 +
* sensitive to initialisation. See variation [http://www.eece.unm.edu/~chsmith/Papers/SLA-NP-EM.pdf].
 +
 
 +
 
 +
===Unsorted===
 +
http://ordination.okstate.edu/eigen.htm
 +
 
 +
-->

Revision as of 13:29, 6 July 2020

This is more for overview of my own than for teaching or exercise.

Overview of the areas

Arithmetic · 'elementary mathematics' and similar concepts
Set theory, Category theory
Geometry and its relatives · Topology
Elementary algebra - Linear algebra - Abstract algebra
Calculus and analysis
Logic
Semi-sorted
 : Information theory · Number theory · Decision theory, game theory · Recreational mathematics · Dynamical systems · Unsorted or hard to sort


Math on data:

  • Statistics as a field
some introduction · areas of statistics
types of data · on random variables, distributions
Virtues and shortcomings of...
on sampling · probability
glossary · references, unsorted
Footnotes on various analyses

Other data analysis, data summarization, learning

Regression · Classification, clustering, decisions · dimensionality reduction · Optimization theory, control theory
Connectionism, neural nets · Evolutionary computing



As a wide concept

Dimensionality reduction can be seen in a very wide sense, of creating a simpler variant of the data that focuses on the more interesting and removes the less interesting.


Note that in such a wide sense a lot of learning is useful as dimensionality reduction, just because the output is a useful and smaller thing, be it clustering, neural networks, whatever.

But in most cases it's also specific output, minimal output for that purpose.


Dimensionality reduction in practice is much more mellow version of that, reducing data to a more manageable form. It historically often referred to things like factor analysis and multivariate analysis, i.e. separating out what seem to be structural patterns, but not doing much with them yet, often acknowledging that we usually still have entangled surface effects in the observations given, and our direct output is probably still a poor view of the actual dependent variables that created the observations we look at. (Whether that's an issue or not depends largely on what it's being used for)


Reducing the amount of dimensions in highly dimensional data is often done to alleviate the curse of dimensionality.[1]

Said curse comes mainly from the fact that for every dimension you add, the implied volume increases quicker and quicker.

So anything that wants to be exhausitve is now in trouble. Statistics has an exponentially harder job of proving significance (at least without exponential amounts more data), Machine learning needs as much more data to train well, optimization needs to consider all combinations of dimensions as variables, etc.


Distance metrics are funnier, in that the influence of any one dimension becomes smaller, so differences in metrics smaller - and more error-prone in things like clustering. (verify)

It's not all bad, but certainly a thing to consider.


Ordination, Factor Analysis, Multivariate analysis

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

(See also Statistics#Multivariate_statistics)


Ordination widely means (re)ordering objects so that similar objects are near each other, and dissimilar objects are further away.


Ordination methods are often a step in something else, e.g. be good data massage before clustering.

It can become more relevant to higher-dimensional data (lots of variables), in that ordination is something you watch (so sort of an implicit side effect) in the process of dimensionality reduction methods.

Of course, different methods have their own goals and focus, different requirements, and their own conventions along the way.

Typical methods include PCA, SVD, and




Factor Analysis, Principal Component Analysis (PCA), and variants

Correspondence Analysis (CA)

Conceptually similar to PCA, but uses a Chi-square distance, to be more appicable to nominal data (where PCA applies to continuous data).


See also:

Multi-dimensionsional scaling (MDS)

Refers to a group of similar analyses, with varying properties and methods, but which all focus on ordination

Commonly mentioned types of MDS include:

  • Classical multidimensional scaling, a.k.a. Torgerson Scaling, Torgerson–Gower scaling.
  • Metric multidimensional scaling
  • Non-metric multidimensional scaling


It is regularly used to provide a proximity visualization, so the target dimensions may be two or three simply because this is easier to plot.


Depending on how you count, there are somewhere between three and a dozen different MDS algorithms.

Some MDS methods closely resemble things like PCA, SVD, and others in how they change the data. Some more generic MDS variants are more on the descriptive side, so can be solved with PCA, SVD, and such.


A ordination, most try to not change the relative distances, but do change the coordinate system in the process.


Input

Input to many method is a similarity matrix - a square symmetric matrix often based on a similarity metric. Note some similar methods may be based on dissimilarity instead.


At a very pragmatic level, you may get

  • a ready-made similarity matrix
  • items plus a method to compare them
  • a table of items versus features (such as coordinates, preferences), with a method to compare them
  • perceived similarities between each item (e.g in market research)

There is little difference, except in some assumption like whether the feature values are Euclidean, independent, orthogonal, and whatnot.


Result evaluation

MDS solutions tend to be fairly optimal, in that for the amount of target dimensions you will get the solution's distances correlating to the original data's distance's as well as they can.

There are still a number of things that help or hinder accuracy, primarily:

  • the choice of input values
  • the choice of target dimensions (since too few lead to ill placement choices)
  • (to some degree) the type of MDS
  • methods that have cluster-like implementations may be sensitive to initial state


You probably want to see how good a solution is.

The simplest method is probably calculating the correlation coefficient between input (dis)similarity and resulting data (dis)similarity, to show how much the MDS result fits the variation in the original. By rule of thumb, values below 0.6 mean a bad solution, and values above 0.8 or 0.9 are pretty good solutions (depending on the accuracy you want, but also varying with the of MDS).

Other methods include Kruskal's Stress, split data tests data stability tests (i.e., eliminating one item, see if result is similar) test-retest reliability [2]


Algorithms

Note that in general, correlation can be complete (if k point in k-1 dimensions, or distances between any two items are equal whichever way they are combined, e.g. if distances are those in euclidean space), but usually is not. The output's choice of axes is generally not particularly meaningful.


The most common example is principal coordinates analysis (PCO, PCoA), also known as Classical multidimensional scaling, Torgerson Scaling and Torgerson-Gower scaling, which is a single calculation step so does not require iteration or convergence testing.

See also (Torgerson, 1952), (Torgerson, 1958), (Gower, 1966), CMDSCALE in R

(Note: PCO (Principle Coordinate analysis) should not be confused with PCA (Principle Component Analysis) as it is not the same method, although apparently equivalent when the PCA kernel function is isotropic, e.g. is working on Euclidean coordinates/distance)(verify)


The first dimension in the result should capture the most variability (first principal coordinate), the second the second most (second principal coordinate), etc. The eigenvectors of the input distance matrix yield the principal coordinates, and the eigenvalues give proportion of variance accounted for. As such, eigenvalue decomposition (or the more general singular value decomposition) can be used for this MDS method. (The degree to which distances are violated can be estimated by how many small or negative eigenvalues. If there are none (...up to a given amount of dimensions...) then the analysis is probably reasonable), and you can use the eigenvalues to calculate how much of the total variance is accounted for(verify) - and you have the option of choosing afterwards how many dimensions you want to use (which you can't do in non-metric).




Metric (multidimensional) scaling: a class of MDS that assumes dissimilarities are distances (and thereby also that they are symmetric). May be used to indicate PCO, but is often meant to indicate a class based on something of a generalization, in that the stress function is more adjustable. The optimization method used is often Stress Majorization (see also SMACOF [3] (Scaling by Majorizing A COmplicated Function)).




On iterative MDS methods: minimize stress no unique solution (so starting position may matter)

  • stress-based MDS methods
  • may be little more than non-parametric version of PCO(verify)



Non-metric multidimensional scaling can be a broad name, but generally find a (non-parametric) monotonic relationship between [the dissimilarities in the item-item matrix and the Euclidean distance between items] and [the location of each item in the low-dimensional space].

The optimization method is usually something like isotonic regression (which is due to monotonicity constraints). Methods regularly have both metric and non-metric parts, and non-metric scaling in the broad sense can describe quite varying methods (see e.g. Sammon's NLM).

Note that the the monotonic detail means that ranking of items ends up as more important than the (dis)similarities. This may be a more appropriate way of dealing with certain data, such as psychometric data, e.g. ratings of different items on an arbitrary scale.

Non-metric MDS may give somewhat lower-stress solutions than metric MDS in the same amount of dimensions.(verify)

Also, certain implementations may deal better with non-normality or varying error distributions (often by not making those assumptions).


Examples:

  • Kruskal's non-metric MDS(verify)
  • Shepard-Kruskal Scaling(verify) (and (verify) whether that isn't the same thing as the last)
  • Sammon non-linear mapping [4]


Variations of algorithms can be described as:

  • Replicated MDS: evaluates multiple matrices simultaneously
  • Three-way scaling
  • Multidimensional Unfolding
  • Restricted MDS




Multidimensional scaling (MDS) [5]

  • classical MDS (quite similar to PCA under some conditions, apparently when you use euclidean distance?)
    • (Difference matrix -> n dimensional coordintes (vectors))
  • Kruskal's non-metric MDS (R: isoMDS)
  • Sammon's non-linear mapping (R: sammon)


See also
  • WS Torgerson (1958) Theory and Methods of Scaling
  • JB Kruskal, and M Wish (1978) Multidimensional Scaling
  • I Borg and P Groenen (2005) Modern Multidimensional Scaling: theory and applications
  • TF Cox and MAA Cox (1994) Multidimensional Scaling


Generalizized MDS (GMDS)

A generalization of metric MDS where the target domain is non-Euclidean.

See also:


Sammon’s (non-linear) mapping

Singular value decomposition (SVD)

See also:

Manifold learning

An approach to non-linear dimensionality reduction.

Can be thought of an extension of PCA-style techniques that is sensitive to non-linear structure in data.


Expectation Maximisation (EM)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A broadly applicable idea/method that iteratively uses the Maximum Likelihood (ML) idea and its estimation (MLE).