Similarity or distance measures/metrics

From Helpful
(Redirected from Semantic similarity)
Jump to navigation Jump to search

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


What is (and isn't) a metric?

Informally, a metric is anything that gives you some sort of usable distance between two things - in practice often vectors, or sequences (including text), or maybe distributions.


In that broad view, there are many metrics.

Some handle different types of data,
some assume different things about the data (e.g. their distribution, how related different properties are or are not, etc.),
some are just better at highlighting a specific kind of dissimilarity, etc.
some have mathematica or practical properties that are necessary or useful for specific problems (e.g. machine learning on high dimensional data)


distance measures and metrics and similarity measures and dissimilarity measures and even divergence could all mean the same thing.


Can we get more specific?

In practice people may use these terms more specifically - with more specific formal properties. But not everyone agrees.

Not all real-world metrics are trying to meet the same mathematical requirements.

In general, if you care, don't trust the word, and read up on the details.


That said:

Words like distance and metric suggest one of the more mathematical ones
Words like divergence suggest asymmetric comparisons
Words like measure could be anything.


For example, in mathematics, metrics are a little better defined [1], with four requirements:

  • non-negativity: d(x, y) ≥ 0
  • identity of indiscernibles: d(x,y) = 0 if and only if x=y
  • symmetry: d(x,y) = d(y,x)
  • triangle inequality: d(x,z) ≤ d(x,y) + d(y,z)

...though apparently norm is a little less restrictive



Roughly:

  • distance can't be negative
that often doesn't have sensible meaning, and may be disruptive to the math as well
  • things with distance zero are, for these purposes, considered the same thing
  • the result has to be the same both ways
there are various cases where there being a difference makes sense, e.g. traffic routing when one-way streets are involved
and many where this doesn't make sense, e.g. as-the-crow-flies spatial distance
  • triangle inequality itself is that the sum of any two sides of a triangle is larger than that of the third. This is probably more intuitive when drawn[2]
when applied to metrics, it's mostly a 'does your metric not break basic arithmetic?'
there are things that break this but may still be useful, e.g. Lk for k<1 applied to high dimensional data (verify)


Such requirements can be relevant

in pragmatic ways (e.g. some analyses break, or make no sense, on non-symmetric measures, or with negative values),
in some mathematical ways (e.g. describe how some spaces act differently).

Some assumptions and conventions

Distance measures are often used to shape data for something else.

Often a form of dimensionality reduction where that's relatively easy, e.g. for things that like simple linear(ish) one-dimensional data more than the raw data, such as most data clustering.


A lot of data is close enough to numbers or vectors already.

Or as (often-equivalent) Euclidean coordinates, where each dimension indicates a feature.


...though note that the ability to shove something into a vector doesn't imply all metrics on it are accurate - or useful at all.

For example, it is tempting to treat such vectors as Euclidean coordinates, which is an implied assumption if you run basic euclidean distance on it - while city block distance may be more valid because it does not implicitly assume that every feature is entangled with every other one.

But maybe that has validity, yet since you implicitly say feature = linear orthogonal dimension, this runs quickly in to the questions of linear algebra.

Chances are that either will give halfway decent results very easily -- but often no better than halfway, and with distortions that are hard to understand or resolve.




In some cases the things you compare are probability distributions, or sometimes an unordered set of probabilities. For example, in distributional similarity you often use the relative probabilities of a word co-occurence.

Similarity or distance measures/metrics

Vector/coordinate measures

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Lk Norms

Lk norms are mentioned as a slight mathematical generalisation, that take the form (|a|k + |b|k + ...)(1/k)

Which in practice is an overkill generalization since you mostly just see:

for k=1 this is the city block distance (a.k.a. L1)
for k=2 this is the Euclidian distance (a.k.a. L2)

See also Minkowski distance


City block / Manhattan distance / L1 distance / Taxicab geometry

  • Input: two n-dimensional vectors (in the coordinate sense), q and r
  • Distance: ∑v|q(v)-r(v)|

i.e. in the difference in each dimension indepdently, summed up.


The intuition that 'city block' suggests (in 2D) is a city in an entirely regular grid (think US cities), and you can only travel on streets, so travel in one directly doesn't affect the distance in the other direction.

That analogy works best for integer coordinates (e.g. all coordinates are street corners), yet the metric stays valid for more number sets.


http://en.wikipedia.org/wiki/Taxicab_geometry

Euclidean / L2 distance

  • Input: two n-dimensional vectors (in the coordinate sense), q and r
  • Distance: length of represented line segment; sqrt( ∑v (q(v)-r(v))2)


From the intuition of city block distance, you're now taking the shortest diagonal.


http://en.wikipedia.org/wiki/Norm_%28mathematics%29#Euclidean_norm

Canberra

  • Input: two n-dimensional vectors, q and r
  • Distance: ∑v|q(v)-r(v)| / ( |q(v)| + |r(v)| )


Variation of the L1 (city block) distance that weighs each distance, roughly by proximity to (0,0).

...which seems like a weird thing to do unless you have a specific reason why that helps what you're doing and have a set meaning to (0,0).

http://en.wikipedia.org/wiki/Canberra_distance

Cosine

  • Input: two n-dimensional vectors (in the magnitude-and-direction sense)
  • Uses fairly obvious form of the cosine rule (TODO: formula)
  • sensitive to vector direction
  • insensitive to vector length
  • Implicitly uses zero point as reference; using reference example not really meaningful


As cosine determines the angle between two vectors, it puts its output on a given scale regardless of vector magnitude.

Useful over euclidean/Lk distances when when length has no important meaning to a given comparison)

However, it lets you put relatively little weight one each dimension when you want that, particularly for high dimensional data.

On high-dimensional data

Sample sets (/paired vectors)

Simple matching coefficient, distance

Given two vectors of boolean (there or not) features, and summarizing variables:

  • p as the number of variables that are positive in both vectors
  • q as the number of variables that are positive in the first and negative in the second
  • r as the number of variables that are negative in the first and positive in the second
  • s as the number of variables that are negative in both
  • t as the total number of variables (which is also p+q+r+s, as those are all exclusive)

Then the simple matching coefficient is the number of agreements (on positive and on negative values)

(p+s) / t

...and the simple matching distance is the number of disagreements:

(q+r) / t


Note that for a number of applications, counting s in the results makes no sense, because it counts absence of anything as hits. See Jaccard.


Jaccard index / Jaccard similarity coefficient (sample sets)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Intuitively, Jaccard similarity measures the amount of features that both vectors agree is present (/true/positive, whatever), divided by the amount of features one or the other has.


You can see this as a variation on simple "count how much they agree" where you also count the disagreements as cases, and of the agreements you only count those on positive values.

Given the same definitions as in simple matching (above), you can say that s are not counted as existing cases at all, and the Jaccard similarity coefficient (also known as Jaccard index) is:

p/(p+q+r)


When you see the data more directly as paired vectors storing booleans, you can also state this as:

  • the size of the boolean intersection (agreement pairs)
  • ...divided by the size of the boolean union (all non-zero pairs)


You can also define a Jaccard distance as 1-jaccard_similarity, which works out as:

(q+r) / (p+q+r)


  • Distance: |V_{qr}|/|V_{q}|\cup|V_{r}|
  • 'Has' is defined by a non-zero probability (I imagine sometimes a low threshold can be useful)
  • Input: N-dimensional vectors, seen to contain boolean values

Tanimoto coefficient

An extension of cosine similarity that, for boolean data, gives the jaccard coefficient.

See also:


Distribution comparisons

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Mutual information

  • Input: two discrete probability distributions / random variables (verify)
  • Symmetric comparison

Intuitively: to what degree two distributions estimate the same values, or the degree to which one distribution tells us something about another.

http://www.scholarpedia.org/article/Mutual_information


  • Introduced in (Shannon 1948)
  • regularly shown in context of error correction and such


Lautum Information

Variation on mutual information (reverses the arguments, therefore the reversed name)

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.6549

Kullback–Leibler divergence

Also known as information divergence, information gain, relative entropy, cross entropy, mean information for discrimination, information number, information divergence, and other names.

Named for the 1951 paper by Kullback and Leibler, "On information and sufficiency", which built on and abstracted earlier work by Shannon and others.


  • Sees vector input as probability distributions
  • Measures the inefficiency of using one distribution for another (entropy idea)
  • Non-symmetric - so not technically a metric, hence 'divergence'

Jensen-Shannon divergence

Also known as information radius(verify)


  • (based on Kullback–Leibler)
  • (mentioned in) Linden, Piitulainen, Discovering Synonyms and Other Related Words [3]

Jeffreys divergence

A symmetrized version of KL divergence


Hellinger distance

http://en.wikipedia.org/wiki/Hellinger_distance


Wasserstein metric, Earth mover's distance, Kantorovich-Rubinstein metric

The Earth mover name comes from the informal explanation: when interpreted as two different ways of piling up the same amount of dirt, this is the minimum cost of turning one pile into the other (where cost is amount*distance).

This is only valid for the same amount of volume, so this may require some normalization, though it is always valid on probability density functions, normalized histograms, and some similar things.


See also:

Lévy–Prokhorov metric

Rankings

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


(What is alpha-skew? something KL based?)


Kendall tau rank correlation coefficient, Kendall Tau distance

Kendall's tau is also known as tau coefficient, Kendall's rank

(Correlation on ordinal data)

Kendall's tau is based on the intuition that if q predicts


The Kendall Tau distance gives a distance between two lists. It is also known as the bubble sort distance as it yields the number of swaps that bubble sort would do.


See also:

(Fuzzy) string comparison

Orthographic (look alike)

Edit distance (of words/strings)

Edit distance is the idea of that the amount of modifications (inserts, deletes, replacements) to get from one string to the other is a good indicator of the difference between the words.

💤 The concept of distances partly overlaps with sequence alignment problems, in particular when distances are calculated via edit distance, because that's pretty similar to the 'maximal amount of matches' requirement that sequence alignment often focuses on.

You could say that sequence alignment may focus more on the actual contents of the (minimal) set of changes (when when doing diffs), where edit distance goes no further than a figure of how dissimilar that amount of edits that makes those.

Some of the below that are not Levenshtein are also used in e.g. bioinformatics.


Edit distance is a more general concept, Levenshtein distance was a fairly popular implementatin of it, but there are a more, including:

  • Hamming distance (boolean inequality, but applied to characters instead of bits. So, only for strings of the same length, and specifically the amount of characters in the same position being identical[6]
which is cheaper (O(n)) but also not fit for many purposes
  • Jaro-Winkler distance [7]
  • Hirschberg's algorithm [8]
  • Wagner-Fischer edit distance [9]
  • Ukkonen's edit distance [10]
Sequences and n-grams
  • sub-sequences
    • Ratcliff/Obershelp
    • LCS: Longest Common Subsequence
      • LCSR: Longest Common Subsequence Ratio, defined as len(lcs)/max(len(one_string),len(other_string))


You can also use n-grams.

N-grams themselves are unordered features, which contain ordered characters.

  • DICE
amount of shared bigrams
2*|X∩Y| / |X|+|Y|
  • XDICE (extended), XXDICE (adds positions)
C Brew, D McKelvie (1996) "Word-pair extraction for lexicography"
  • Jaccard
|X∩Y| / |X∪Y|
  • TRIGRAM-2B
seems to be the name for the best-performing of the best-performing variant in...
B Lambert et al. (1999) "Similarity as a risk factor in drug-name confusion errors: the look-alike (orthographic) and sound-alike (phonetic) model"
and is like DICE but with n=3 and adding two spaces to the start (verify)
  • BI-SIM (verify) and a mentioned TRI-SIM
G Kondrak, B Dorr (2005) "Automatic identification of confusable drug names"
  • WDICE (weighed)
  • WXDICE




Unsorted


C Brew, D McKelvie (1996), "Word-pair extraction for lexicography"


  • ALINE (phonetic sequences)
G Kondrapok (2000) "A new algorithm for the alignment of phonetic sequences"
S Downey et al. (2008) "Computational feature-sensitive reconstruction of language relationships: Developing the ALINE distance for comparative historical linguistic reconstruction"

https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=3388&context=etd


Phonological (sound alike)

Levenshtein on orthographic input may be better than nothing, but does not end up matching pronunciation

Distances based on sounds often extend orthographic systems with some pronunciation models.


Things like L04 may be a more advanced levenshein-style comparison, in that it has the ability to weigh different replacements via a model, it assumes input is already phonemes, and that is half the problem.


So you first look at mapping text to phonemes.

Consider:

  • Soundex (back from 1919), primarily for names: uses First few letters in roughly phonetic categories


primarily for names
performs slightly better than Soundex [11] [12]


primarily for names
similar to and apparently slightly better than NYSIIS.


  • You can apply metaphone or such to the input first
any text input, in theory
Metaphone (from 1990) Maps whole string into consonants roughly according to a specific language
primarily for names, though it does decently for other words as well.
Double metaphone (2000) is an improvement that
accounts for more language-specific irregularities
can return two results when pronunciation can be ambiguous
Metaphone 3 (2009) is some further tweaking


  • EDITEX: distance metric comparable to levenshtein with some metaphone-like processing (using character equivalence groups)
    • J Zobel, P Dart, Phonetic string matching: Lessons from information retrieval (1996)
  • ALINE distance (See Kondrak 2000, A New Algorithm For The Alignment Of Phonetic Sequences)
  • CORDI (See Kondrak 2002, Determining recurrent sound correspondences by inducing translation models)
the first version (from 2002) was specifically for names
the second version (from 2004) is more generally applicable



Note that there are some further footnotes beyond 'Use something like levenshtein I guess.

Say, Tapered weighing (weighing distances in the start / first quarter/half of the strong) is sometimes effective for certain languages, e.g. those that mark inflection primarily with suffix morphemes.



Metaphone notes

Metaphone is an reduction of text into consonant phones, aware of a specific language. Vowels are not represented, but they are used in the transform.

Initially made for names, for features like 'similar spellings', 'possible misspellings'.


The original metaphone implementation was written for English, but similar transforms have been written for some other languages.


English:

Characters are consonant sounds: (this may not be the fill list?(verify):

  • B
  • X ('zj' sound, as in location, fish)
  • S ('s' and 'z' sound, as in science, zoot and most regular appearances of S)
  • K (as in school, query, cookie)
  • J (as in junior, edge)
  • T ('t' amd 'd' sound, as in tomb and dumb)
  • F (as in e.g. phone, fish)
  • H
  • L
  • M
  • N
  • P
  • R
  • 0 (as in thumb) (note: a zero, not the letter O)
  • W
  • Y

Rules consider simple contextual changes and silent letters.

There is no meaning to capitalization. Some systems only generate lowercase, some only uppercase metaphone strings.


Double metaphone

Double metaphone is a mild expansion of basic metaphone that can give two output string. Meant for more flexibility to alternative pronunciations, apparently designed for foreign-but-fairly-regular pronunciation of names.

For example:

"Smith" -> (SM0, XMT)
"Schmidt" (XMT, SMT)


See also


Unsorted (string comparison)

http://www.morfoedro.it/doc.php?n=223&lang=en

http://www.dcs.shef.ac.uk/~sam/stringmetrics.html#tanimoto

Coding / annotator / rater consistency

Cohen's kappa

https://en.wikipedia.org/wiki/Cohen%27s_kappa


Fleiss' kappa

https://en.wikipedia.org/wiki/Fleiss%27_kappa


Krippendorff's alpha

https://en.wikipedia.org/wiki/Krippendorff%27s_alpha


Concordance correlation coefficient

https://en.wikipedia.org/wiki/Concordance_correlation_coefficient



Larger / semantic text comparison

Semantic similarity is the broad area of metrics between words, phrases, and/or documents.

It might by anything that gives you distances between those, based on any sort of estimation of the likeness.

Methods include:


Semantic similarity based on ontologies

Focusing on known words, and on "is a" knowledge and other ontological relationships (see also lexical databases)

Say, where many methods may put 'car', 'road', and 'driving' in the same area, this may also say roughly how they are related
often gives cleaner distance results, except that knowledge base has to be extensive to not miss things, particularly on unseen topics


...and in this context, a distinction is sometimes made to Semantic relatedness which widens from is-a to also include antonyms (opposites), meronyms (part of whole), hyponyms/hypernyms)


Semantic similarity based on other statistics/probability

A lot of estimations rely on "seems to co-occur more than we expect if it weren't related".


There are more statistical estimations of such ontological relationships

a middle point on the scale of false negatives and false positives
(and sometimes a way of refining a knowledge base more quickly)


Semantic similarity based on vector space representations

See embeddings, but the tl;dr is that vector space representations, as applied to words, uses a bulk of text to assigns a vector to each word that you can meaningfully compare to any other word.

Exactly how well that works depends on a lot of things. There is an upper limit, though there are many useful things you can do.


Word Mover's Distance

Unsorted

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Various meaures

  • Cosine correlation
  • Back-off method (Katz 1987) (?)
  • Distance weighted averaging
  • Confusion probability (approximation of KL?)

...and more.


Bhattacharyya



Alpha skew (Sα) divergence

See also:


Chi-squared distance

Confusion probability (τα)

CY dissimilarity

Gower dissimilarity

Jiang's & Conranth similarity, Jiang's & Conranth distance

See also:


Kulczynski dissimilarity/distance

a.k.a. Kulsinski ?


Lin's Similarity Measure

See also:

  • Lin, D. (1998) An information-theoretic definition of similarity. Proceedings of the 15th International Conference on Machine Learning, San Francisco, CA, pp. 296–304.


Mahalanobis distance

McArdle-Gower dissimilarity

Orloci's chord distance

Resnik's similarity measure

See also:

  • P. Resnik's (1995) Using Information Content to Evaluate Semantic Similarity in a Taxonomy

Rogers-Tanimoto dissimilarity

Russell-Rao dissimilarity

Sokal-Michener dissimilarity

Sokal-Sneath dissimilarity

Sørensen, Dice, Bray-Curtis, etc.

Names:

  • Sørensen similarity index
  • Dice's coefficient
  • Czekanowski index
  • Hellinger distance
  • Bray-Curtis dissimilarity

...are all related, and some are identical under certain restrictions or when applied to certain types of data.

See also


Yule dissimilarity

Combinations

Back-off (e.g. Katz') vs. averaging

Media

General comparison

  • MAE - Mean Absolute Error
  • MSE - Mean Squared Error
  • PSE -
  • PSNR - Peak Signal-to-Noise Ratio
  • RMSE - Root Mean Squared Error




Others

Consistency=

Cronbach alpha

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Cronbach alpha is a "measure of the reliability of a psychometric instrument" (wikipedia) by, roughly speaking, comparing the consistency of individual results against the rest.


The resulting value (between 0.0 and 1.0) can tell you how consistent the data is, and that also implies how viable/reliable a particular run of an analysis is.

A low Cronbach alpha indicates the data does not seem to have the structure the analysis is looking for, and/or that the result quality is limited. This may be caused by various things, often by example data not fitting into your model or the noise level being high, possibly because your data is generally sparse (and thereby this test is too), or even simply that you are measuring something without structure.


Social sciences generally deem below 0.7 as unreliable, anything above that as good enough, something above 0.9 as better - though that varies somewhat with the expectable noise in the specific dataset.


See also:


See also

  • Lee, Measures of Distributional Similarity (1999), [13]
  • Weeds, Weir, McCarthy, Characterising Measures of Lexical Distributional Similarity, [14]


  • CE Shannon (1948) A Mathematical Theory of Communication. [15]



Various:

  • Kruskal - An overview of sequence comparison: time warps, string edits, and macromolecules