Similarity or distance measures/metrics
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
Contents
- 1 What is (and isn't) a metric?
- 2 Some assumptions and conventions
- 3 Vector/coordinate measures
- 4 (Fuzzy) string comparison
- 5 Sample sets (/paired vectors)
- 6 Distribution comparisons
- 7 Comparing rankings
- 8 Unsorted
- 8.1 Alpha skew (S_{α}) divergence
- 8.2 Chi-squared distance
- 8.3 Confusion probability (τ_{α})
- 8.4 CY dissimilarity
- 8.5 Gower dissimilarity
- 8.6 Jiang's & Conranth similarity, Jiang's & Conranth distance
- 8.7 Jensen-Shannon divergence
- 8.8 Kulczynski dissimilarity/distance
- 8.9 Lin's Similarity Measure
- 8.10 Mahalanobis distance
- 8.11 McArdle-Gower dissimilarity
- 8.12 Orloci's chord distance
- 8.13 Resnik's similarity measure
- 8.14 Rogers-Tanimoto dissimilarity
- 8.15 Russell-Rao dissimilarity
- 8.16 Sokal-Michener dissimilarity
- 8.17 Sokal-Sneath dissimilarity
- 8.18 Sørensen, Dice, Bray-Curtis, etc.
- 8.19 Yule dissimilarity
- 9 Combinations
- 10 Media
- 11 See also
What is (and isn't) a metric?
Roughly, a metric is anything that gives you some sort of usable distance between two things - vectors, sequences, or whatnot.
There are a fair number of distance/divergence measures. Some handle different types of data, some assume different things about the data (e.g. distribution, and this can make them work better if accurate), some are just better than others when you are trying to bring out a specific kind of dissimilarity.
distance measures and metrics and similarity measures and dissimilarity measures and even divergence could all mean the same thing.
In practice people may use these terms more exactly - with more specific formal properties.
Formally[1], metrics are a little better defined, giving four requirements:
- non-negativity: d(x, y) ≥ 0
- symmetry: d(x,y) = d(y,x)
- identity of indiscernibles: d(x,y) = 0 if and only if x=y
- triangle inequality: d(x,z) ≤ d(x,y) + d(y,z)
These can be relevant in some pragmatic ways (e.g. some analyses make no sense on non-symmetric measures), and in some mathematical ways (e.g. some spaces act differently).
Also, not all real-world metrics are trying to meet all of that, sometimes by useful design.
Words like distance and metric suggest an metric in the mathematical sense
Words like divergence suggest asymmetric comparisons
Words, like measure could be anything.
In general, if you care, then check.
Some assumptions and conventions
Distance measures are often used to get data ready as input for something else, for example clustering, or something else that likes simple linear data more than the raw data.
This can also be based on other things, such as probabilities.
For example, in distributional similarity you often use the relative probabilities of a word co-occurence.
Data can regularly be seen as vectors, or as often-equivalent Euclidean coordinates, where each dimension indicates a feature.
Treatment as Euclidean coordinates may suggest people are less sure of the independence of the dimensions, although this may not be a given even if not mentioned.
Vector/coordinate measures
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
L_{k} Norms
L_{k} norms are mentioned as a (slight) mathematical generalisation (though in practice you mostly see L1 and L2).
They take take the form:
(|a|^{k} + |b|^{k} + ...)^{(1/k)}
For k=1 this is the city block distance, for k=2 this is the Euclidian distance, which are also the two that seem to matter most. See notes on L1 ad L2 below.
Euclidean / L2 distance
- Input: two n-dimensional vectors, q and r
- Distance: length of represented line segment; Sqrt( ∑_{v} (q(v)-r(v))^{2})
http://en.wikipedia.org/wiki/Norm_%28mathematics%29#Euclidean_norm
City block / Manhattan distance / L1 distance / Taxicab geometry
- Input: two n-dimensional vectors, q and r
- Distance: ∑_{v}|q(v)-r(v)|
http://en.wikipedia.org/wiki/Taxicab_geometry
Canberra
Variation of the city block distance (weighing?)
http://en.wikipedia.org/wiki/Canberra_distance
Cosine
- Input: two n-dimensional vectors
- Uses fairly obvious form of the cosine rule (will do formula later)
- Effectively a simple feature correlation measure
- sensitive to vector direction
- insensitive to vector length (useful over euclidean/Lk distances when length has no important meaning to a given comparison)
- Inherently uses zero point as reference; using reference example not really meaningful
(Fuzzy) string comparison
Orthographic (look alike)
Based on specific writing systems; mostly useful for phonetic alphabets.
Orthographic methods can regularly be extended from to have phonetical considerations. For example, since Levenshtein compares characters(/tokens) at a time, that comparison can be weighed, or based on a phonetic model (see e.g. L04), or perhaps applied after metaphone was applied to the input.
Methods/algorithms include:
Edit distance (of words/strings)
Edit distance is the idea of that the amount of modifications (inserts, deletes, replacements) to get from one string to the other is itself a good indicator of its difference - and one which you can fine-tune.
Depending on context, the term 'edit distance' may refer to Levenshtein distance,
or to a group of (often character-based) algorithms that do this, including:
- Levenshtein distance [2]
- Damerau-Levenshtein distance [3]
- Hamming distance (boolean inequality, but applied to characters instead of bits. So, only for strings of the same length, and specifically the amount of characters in the same position being identical[4]
- Which is O(n), so cheap, but also not fit for many purposes
- Jaro-Winkler distance [5]
- Hirschberg's algorithm [6]
- Wagner-Fischer edit distance [7]
- Ukkonen's edit distance [8]
Sequences and n-grams
- n-gram comparison
- sub-sequences
- Ratcliff/Obershelp
- LCS: Longest Common Subsequence
- LCSR: Longest Common Subsequence Ratio, defined as len(lcs)/max(len(one_string),len(other_string))
Phonological (sound alike)
Distance correctness often depends on the language something is based on.
Pronunciation models (and distances based on such):
- Soundex (back from 1919), primarily for names: uses First few letters in roughly phonetic categories
- NYSIIS (New York State Identification and Intelligence System) (from 1970), primarily for names, performs slightly better than Soundex [9] [10]
- MRA (Match Rating Approach) (from 1977), also primarily for names, is similar to and apparently slightly better than NYSIIS.
- EDITEX: distance metric comparable to levenshtein with some metaphone-like processing (using character equivalence groups)
- J Zobel, P Dart, Phonetic string matching: Lessons from information retrieval (1996)
- Caverphone's first version (from 2002) was specifically for names, the second version (from 2004) is more generally applicable.
- ALINE distance (See Kondrak 2000, A New Algorithm For The Alignment Of Phonetic Sequences)
- CORDI (See Kondrak 2002, Determining recurrent sound correspondences by inducing translation models)
- Metaphone (from 1990) Maps whole string into consonants roughly according to a specific language
- primarily for names, though it does decently for other words as well.
- Double metaphone (2000) is an improvement that
- accounts for more language-specific irregularities
- can return two results when pronunciation can be ambiguous
- Metaphone 3 (2009) is some further tweaking
- Levenshtein on phonetic tokens, e.g. metaphone or any other that returns such a string
- Note: Tapered weighing (weighing distances in the start / first quarter/half of the strong) is sometimes effective for certain languages, e.g. those that mark inflection primarily with suffix morphemes.
Unsorted
http://www.morfoedro.it/doc.php?n=223&lang=en
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html#tanimoto
Sample sets (/paired vectors)
Simple matching coefficient, distance
Given two vectors of boolean (there or not) features, and summarizing variables:
- p as the number of variables that are positive in both vectors
- q as the number of variables that are positive in the first and negative in the second
- r as the number of variables that are negative in the first and positive in the second
- s as the number of variables that are negative in both
- t as the total number of variables (which is also p+q+r+s, as those are all exclusive)
Then the simple matching coefficient is the number of agreements (on positive and on negative values)
(p+s) / t
...and the simple matching distance is the number of disagreements:
(q+r) / t
Note that for a number of applications, counting s in the results makes no sense, because it counts absence of anything as hits. See Jaccard.
Jaccard index / Jaccard similarity coefficient (sample sets)
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
Intuitively, Jaccard similarity the amount of features that both vectors agree is present (/true/positive, whatever), divided by the amount of features one or the other has.
You can see this as a variation on simple matching that disregards the cases where both agree the feature is missing (both are false).
You count the disagreements as cases, and of the agreements you only count those on positive values.
Given the same definitions as in simple matching (above), you can say that s are not counted as existing cases at all, and the Jaccard similarity coefficient (also known as Jaccard index) is:
p/(p+q+r)
When you see the data more directly as paired vectors storing booleans, you can also state this as:
- the size of the boolean intersection (agreement pairs)
- ...divided by the size of the boolean union (all non-zero pairs)
You can also define a Jaccard distance as 1-jaccard_similarity, which works out as:
(q+r) / (p+q+r)
- Distance: |V_{qr}|/|V_{q}|\cup|V_{r}|
- 'Has' is defined by a non-zero probability (I imagine sometimes a low threshold can be useful)
- Input: N-dimensional vectors, seen to contain boolean values
Tanimoto coefficient
An extension of cosine similarity that, for boolean data, gives the jaccard coefficient.
See also:
Distribution comparisons
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
Note that various
Mutual information
- For discrete probability distributions (verify)
- Symmetric comparison
Intuitively: to what degree two distributions estimate the same values, or the degree to which one distribution tells us something about another.
http://www.scholarpedia.org/article/Mutual_information
- Introduced in (Shannon 1948)
- regularly shown in context of error correction and such
Lautum Information
Kullback–Leibler divergence
Also known as information divergence, information gain, relative entropy, cross entropy, mean information for discrimination, information number, information divergence, and other names.
Named for the 1951 paper by Kullback and Leibler, "On information and sufficiency", which built on and abstracted earlier work by Shannon and others.
- Sees vector input as probability distributions
- Measures the inefficiency of using one distribution for another (entropy idea)
- Non-symmetric - so not technically a metric, hence 'divergence'
Jeffreys divergence
A symmetrized version of KL divergence
Jensen-Shannon divergence
Also known as information radius(verify)
- (based on Kullback–Leibler)
- (mentioned in) Linden, Piitulainen, Discovering Synonyms and Other Related Words [11]
Hellinger distance
http://en.wikipedia.org/wiki/Hellinger_distance
Wasserstein metric
See also:
Earth mover's distance
See also:
Comparing rankings
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
What is alpha-skew? something KL based?
Kendall tau rank correlation coefficient, Kendall Tau distance
Kendall's tau is also known as tau coefficient, Kendall's rank
(Correlation on ordinal data)
Kendall's tau is based on the intuition that if q predicts
The Kendall Tau distance gives a distance between two lists.
It is also known as the bubble sort distance as it yields the number of swaps that bubble sort would do.
See also:
Unsorted
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
Various meaures
- Cosine correlation
- Back-off method (Katz 1987) (?)
- Distance weighted averaging
- Confusion probability (approximation of KL?)
...and more.
Bhattacharyya
Alpha skew (S_{α}) divergence
See also:
- On the Effectiveness of the Skew Divergence for Statistical Language Analysis
- http://www.dcs.shef.ac.uk/%7Emlap/papers/dv_acl03.pdf
- Weeds, The Reliability of a Similarity Measure
Chi-squared distance
Confusion probability (τ_{α})
CY dissimilarity
Gower dissimilarity
Jiang's & Conranth similarity, Jiang's & Conranth distance
See also:
- J. J. Jiang's & D. W. Conranth (1997) "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy"
- http://arxiv.org/abs/cmp-lg/9709008
Jensen-Shannon divergence
(a.k.a. Information Radius ?)
Kulczynski dissimilarity/distance
a.k.a. Kulsinski ?
Lin's Similarity Measure
See also:
- Lin, D. (1998) An information-theoretic definition of similarity. Proceedings of the 15th International Conference on Machine Learning, San Francisco, CA, pp. 296–304.
Mahalanobis distance
McArdle-Gower dissimilarity
Orloci's chord distance
Resnik's similarity measure
See also:
- P. Resnik's (1995) Using Information Content to Evaluate Semantic Similarity in a Taxonomy
Rogers-Tanimoto dissimilarity
Russell-Rao dissimilarity
Sokal-Michener dissimilarity
Sokal-Sneath dissimilarity
Sørensen, Dice, Bray-Curtis, etc.
Names:
- Sørensen similarity index
- Dice's coefficient
- Czekanowski index
- Hellinger distance
- Bray-Curtis dissimilarity
...are all related, and some are identical under certain restrictions or when applied to certain types of data.
See also
- http://en.wikipedia.org/wiki/S%C3%B8rensen_similarity_index
- http://en.wikipedia.org/wiki/Dice%27s_coefficient
- http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice%27s_coefficient
- http://en.wikipedia.org/wiki/Bray%E2%80%93Curtis_dissimilarity
Yule dissimilarity
Combinations
Back-off (e.g. Katz') vs. averaging
Media
General comparison
- MAE - Mean Absolute Error
- MSE - Mean Squared Error
- PSE -
- PSNR - Peak Signal-to-Noise Ratio
- RMSE - Root Mean Squared Error
See also
- http://infomap.stanford.edu/book/chapters/chapter4.html
- http://149.170.199.144/multivar/dist.htm
- Cosine: http://www.mega.nu:8080/ampp/rummel/uc.htm#C5
- Lee, Measures of Distributional Similarity (1999), [12]
- Weeds, Weir, McCarthy, Characterising Measures of Lexical Distributional Similarity, [13]
- CE Shannon (1948) A Mathematical Theory of Communication. [14]
Various:
- http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
- Dagan, Lee, Pereira, Similarity-Based Methods For Word Sense Disambiguation ('97)
- Lee, Measures of Distributional Similarity ('99)
- Weeds, Weir, McCarthy, Characterising Measures of Lexical Distributional Similarity
- http://www.cl.cam.ac.uk/users/alk23/subcat/subcat.html
- Kruskal - An overview of sequence comparison: time warps, string edits, and macromolecules