Similarity or distance measures

From Helpful

Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes and assertions, some of which may well be wrong. Feel free to ignore it, and to fix it.

Measures that usually compare between vectors (as locations), sequences, and a few other types of data.

There are a fair number of distance/divergence measures. Some handle different types of data, some focus on different aspects, and some are just better than others when you are trying to bring out a specific kind of dissimilarity.


Contents

What is and isn't a metric?

Colloquially, a metric is anything that gives you some sort of distance between two things.


Formally[1] they are a little better defined, giving four requirements:

  • non-negativity: d(x, y) ≥ 0
  • symmetry: d(x,y) = d(y,x)
  • identity of indiscernibles: d(x,y) = 0 if and only if x=y
  • triangle inequality: d(x,z) ≤ d(x,y) + d(y,z)

This is somewhat relevant in some mathematical ways (e.g. some spaces act differently), an some more pragmatic ways (some analyses which make no sense on non-symmetric measures).


...but one of the main points is that a whole bunch of the below are not trying to satisfy all of the above, some by useful design.

So:

  • Words like divergence suggest asymmetric comparisons
  • Words like distance and metric suggest an actual metric (but check if you care)
  • Words, like measure could be anything (so check if you care)



Some assumptions and conventions

Distance measures are often used to get data ready as input for something else, for example clustering, or something else that likes simple linear data more than the raw data.

This can also be based on other things, such as probabilities. For example, in distributional similarity you often use the relative probabilities of a word co-occurence.


Data can regularly be seen as vectors, or as often-equivalent Euclidean coordinates, where each dimension indicates a feature. Treatment as Euclidean coordinates may suggest people are less sure of the independence of the dimensions, although this may not be a given even if not mentioned.



Vector/coordinate measures

This article/section is a stub — probably a pile of half-sorted notes and assertions, some of which may well be wrong. Feel free to ignore it, and to fix it.

Lk Norms

Lk norms are mentioned as a (slight) mathematical generalisation (though in practice you mostly see L1 and L2).

They take take the form:

(|a|k + |b|k + ...)(1/k) 

For k=1 this is the city block distance, for k=2 this is the Euclidian distance, which are also the two that seem to matter most. See notes on L1 ad L2 below.


Euclidean / L2 distance

  • Input: two n-dimensional vectors, q and r
  • Distance: length of represented line segment; Sqrt( ∑v (q(v)-r(v))2)

http://en.wikipedia.org/wiki/Norm_%28mathematics%29#Euclidean_norm


City block / Manhattan / L1 distance / Taxicab geomentry

  • Input: two n-dimensional vectors, q and r
  • Distance: ∑v|q(v)-r(v)|

http://en.wikipedia.org/wiki/Taxicab_geometry


Canberra

Variation of the city block distance (weighing?)

http://en.wikipedia.org/wiki/Canberra_distance


Cosine

  • Input: two n-dimensional vectors
  • Uses fairly obvious form of the cosine rule (will do formula later)
  • Effectively a simple feature correlation measure
  • sensitive to vector direction
  • insensitive to vector length (useful over euclidean/Lk distances when length has no important meaning to a given comparison)
  • Inherently uses zero point as reference; using reference example not really meaningful

Sample sets (/paired vectors)

Simple matching coefficient, distance

Given two vectors of boolean (there or not) features, and summarizing variables:

  • p as the number of variables that are positive in both vectors
  • q as the number of variables that are positive in the first and negative in the second
  • r as the number of variables that are negative in the first and positive in the second
  • s as the number of variables that are negative in both
  • t as the total number of variables (which is also p+q+r+s, as those are all exclusive)

Then the simple matching coefficient is the number of agreements (on positive and on negative values)

(p+s) / t

...and the simple matching distance is the number of disagreements:

(q+r) / t


Note that for a number of applications, counting s in the results makes no sense, because it counts absence of anything as hits. See Jaccard.


Jaccard index / Jaccard similarity coefficient (sample sets)

This article/section is a stub — probably a pile of half-sorted notes and assertions, some of which may well be wrong. Feel free to ignore it, and to fix it.

Intuitively, Jaccard similarity the amount of features that both vectors agree is present (/true/positive, whatever), divided by the amount of features one or the other has.


You can see this as a variation on simple matching that disregards the cases where both agree the feature is missing (both are false). You count the disagreements as cases, and of the agreements you only count those on positive values.

Given the same definitions as in simple matching (above), you can say that s are not counted as existing cases at all, and the Jaccard similarity coefficient (also known as Jaccard index) is:

p/(p+q+r)


When you see the data more directly as paired vectors storing booleans, you can also state this as:

  • the size of the boolean intersection (agreement pairs)
  • ...divided by the size of the boolean union (all non-zero pairs)


You can also define a Jaccard distance as 1-jaccard_similarity, which works out as:

(q+r) / (p+q+r)


  • Distance: |V_{qr}|/|V_{q}|\cup|V_{r}|
  • 'Has' is defined by a non-zero probability (I imagine sometimes a low threshold can be useful)
  • Input: N-dimensional vectors, seen to contain boolean values

Tanimoto coefficient

An extension of cosine similarity that, for boolean data, gives the jaccard coefficient.

See also:

Distribution comparisons

This article/section is a stub — probably a pile of half-sorted notes and assertions, some of which may well be wrong. Feel free to ignore it, and to fix it.

Note that various


Mutual information

  • For discrete probability distributions (verify)
  • Symmetric comparison

Intuitively: to what degree two distributions estimate the same values, or the degree to which one distribution tells us something about another.

http://www.scholarpedia.org/article/Mutual_information


  • Introduced in (Shannon 1948)
  • regularly shown in context of error correction and such

Lautum Information

Kullback–Leibler divergence

Also known as information divergence, information gain, relative entropy, cross entropy, mean information for discrimination, information number, information divergence, and other names.

Named for the 1951 paper by Kullback and Leibler, "On information and sufficiency", which built on and abstracted earlier work by Shannon and others.


  • Sees vector input as probability distributions
  • Measures the inefficiency of using one distribution for another (entropy idea)
  • Non-symmetric - so not technically a metric, hence 'divergence'


Jeffreys divergence

A symmetrized version of KL divergence

Jensen-Shannon divergence

Also known as information radius(verify)


  • (based on Kullback–Leibler)
  • (mentioned in) Linden, Piitulainen, Discovering Synonyms and Other Related Words [2]


Hellinger distance

http://en.wikipedia.org/wiki/Hellinger_distance


Wasserstein metric

See also:

Earth mover's distance

See also:

Comparing rankings

This article/section is a stub — probably a pile of half-sorted notes and assertions, some of which may well be wrong. Feel free to ignore it, and to fix it.


What is alpha-skew? something KL based?


Kendall tau rank correlation coefficient, Kendall Tau distance

Kendall's tau is also known as tau coefficient, Kendall's rank

(Correlation on ordinal data)

Kendall's tau is based on the intuition that if q predicts


The Kendall Tau distance gives a distance between two lists. It is also known as the bubble sort distance as it yields the number of swaps that bubble sort would do.


See also:

Unsorted

This article/section is a stub — probably a pile of half-sorted notes and assertions, some of which may well be wrong. Feel free to ignore it, and to fix it.

Various meaures

  • Cosine correlation
  • Back-off method (Katz 1987) (?)
  • Distance weighted averaging
  • Confusion probability (approximation of KL?)

...and more.


Bhattacharyya


Jensen-Shannon divergence

(a.k.a. Information Radius ?)


Sørensen, Dice, Bray-Curtis, etc.

Names:

  • Sørensen similarity index
  • Dice's coefficient
  • Czekanowski index
  • Hellinger distance
  • Bray-Curtis dissimilarity

...are all related, and some are identical under certain restrictions or when applied to certain types of data.

See also


Alpha skew (Sα) divergence

See also:


Confusion probability (τα)

CY dissimilarity

Kulczynski distance

Chi-squared distance

Orloci's chord distance

Lin's Similarity Measure

See also:

  • Lin, D. (1998) An information-theoretic definition of similarity. Proceedings of the 15th International Conference on Machine Learning, San Francisco, CA, pp. 296–304.


Jiang's & Conranth similarity, Jiang's & Conranth distance

See also:


Resnik's similarity measure

See also:

  • P. Resnik's (1995) Using Information Content to Evaluate Semantic Similarity in a Taxonomy


Gower dissimilarity

Kulczynski dissimilarity

McArdle-Gower dissimilarity

Combinations

Back-off (e.g. Katz') vs. averaging

Media

General comparison

  • MAE - Mean Absolute Error
  • MSE - Mean Squared Error
  • PSE -
  • PSNR - Peak Signal-to-Noise Ratio
  • RMSE - Root Mean Squared Error



See also

  • Lee, Measures of Distributional Similarity (1999), [3]
  • Weeds, Weir, McCarthy, Characterising Measures of Lexical Distributional Similarity, [4]


  • CE Shannon (1948) A Mathematical Theory of Communication. [5]



Various:

  • Kruskal - An overview of sequence comparison: time warps, string edits, and macromolecules
Personal tools