The
principle that the frequency of the rth most common word or phrase in a
relatively lengthy text (or in any natural language) is approximately 1/r, with
r equal to its statistical rank in frequency. This means that the 10th most
frequent word will be used about twice as often as the 20th most frequent word,
and ten times more often than the 100th most frequent word. Another way of
stating Zipf's Law is that the frequency (P) of the rth most common word or
phrase is Pr = 1/r a, with a close to 1 and for r up to about 1000 (the
phenomenon breaks down for less commonly used words). Based on the observations
of Harvard linguist George Kingsley Zipf (1902-1950), the relationship can also
be stated in the equation r x f = k, where r is the rank of the word, f is its
frequency, and k is a constant. Illustrating his point with an analysis of the
text of James Joyce's Ulysses, Zipf found that the 10th most frequent word was
used 2,653 times, the 100th most common word was used 265 times, and so on,
yielding a constant of approximately 26,500. Although Zipf's Law is not a
statistically accurate predictor, indexers find it helpful.
Source:
IGNOU Study Materials
Wikipedia
0 Comments