Power law

From Helpful
Revision as of 15:18, 23 July 2021 by Helpful (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Power laws are often named when someone wishes to indicate certain statistical properties.

Regularly named power laws, or power distributions, include the Zipf distribution, the zeta distribution, the Pareto distribution, and a few others.

See also http://en.wikipedia.org/wiki/Power_law

Zipf's law, Zipfian word distributions

Linguists often specifically name Zipf's law, which refers to to the observation that a word's frequency in a text is (roughly) inversely proportional to its rank.

There are various ways to graph this, see e.g. the graphs in the various references.

This has implications such as that in top so-many terms in some text accounts for the bulk of that text. In corpora of text, half the word use is often covered by the top ~100-200 of words that occur in that text. In languages that have function words, those are likely to take most or all places in the top ten.

For example, you may find that in some English document

  • the (rank 1) occurs as ~7% of all words
  • of (rank 2) occurs ~3.5%
  • and (rank 3) occurs ~2.9%
  • You can estimate that something at rank 10 would occur ~0.7%
  • something at rank 1000 occurs 0.007%
  • ...etc.

Larger things seen as units also roughly follow a Zipfian distribution. For example, if you analyse emails or some chat logs, or just people interacting, the top sentences consist largely of formalities, interjections, responses, and (other) daily social interaction.

See also