Difference between revisions of "Power law"

From Helpful
Jump to: navigation, search
m
m (Zipf's law, Zipfian word distributions)
 
Line 17: Line 17:
  
  
This has implications such as that in top so-many terms in some text accounts for the bulk of that text. In corpora of text, half the word use is often covered by the top ~100-200 of words that occur in that text. In languages that have [[function words]], those are likely to take most or all places in the top ten.
+
This has implications such as that in top so-many terms in some text accounts for the bulk of that text.
 +
In corpora of text, half the word use is often covered by the top ~100-200 of words that occur in that text. In languages that have [[function words]], those are likely to take most or all places in the top ten.
  
 
For example, you may find that in some English document
 
For example, you may find that in some English document
Line 23: Line 24:
 
* ''of'' (rank 2) occurs ~3.5%
 
* ''of'' (rank 2) occurs ~3.5%
 
* ''and'' (rank 3) occurs ~2.9%
 
* ''and'' (rank 3) occurs ~2.9%
 +
* You can estimate that something at rank 10 would occur ~0.7%
 +
* something at rank 1000 occurs 0.007%
 
* ...etc.
 
* ...etc.
 
You can estimate that something at rank 10 would occur ~0.7%, something at rank 1000 occurs 0.007%, etc.
 
  
  
 
Larger things seen as units also roughly follow a Zipfian distribution.  
 
Larger things seen as units also roughly follow a Zipfian distribution.  
 
For example, if you analyse emails or some chat logs, or just people interacting, the top sentences consist largely of formalities, interjections, responses, and (other) daily social interaction.
 
For example, if you analyse emails or some chat logs, or just people interacting, the top sentences consist largely of formalities, interjections, responses, and (other) daily social interaction.
 
  
 
==See also==
 
==See also==

Latest revision as of 15:18, 23 July 2021

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Power laws are often named when someone wishes to indicate certain statistical properties.

Regularly named power laws, or power distributions, include the Zipf distribution, the zeta distribution, the Pareto distribution, and a few others.

See also http://en.wikipedia.org/wiki/Power_law



Zipf's law, Zipfian word distributions

Linguists often specifically name Zipf's law, which refers to to the observation that a word's frequency in a text is (roughly) inversely proportional to its rank.

There are various ways to graph this, see e.g. the graphs in the various references.


This has implications such as that in top so-many terms in some text accounts for the bulk of that text. In corpora of text, half the word use is often covered by the top ~100-200 of words that occur in that text. In languages that have function words, those are likely to take most or all places in the top ten.

For example, you may find that in some English document

  • the (rank 1) occurs as ~7% of all words
  • of (rank 2) occurs ~3.5%
  • and (rank 3) occurs ~2.9%
  • You can estimate that something at rank 10 would occur ~0.7%
  • something at rank 1000 occurs 0.007%
  • ...etc.


Larger things seen as units also roughly follow a Zipfian distribution. For example, if you analyse emails or some chat logs, or just people interacting, the top sentences consist largely of formalities, interjections, responses, and (other) daily social interaction.

See also