Zipf's Law for Random Texts

Feb 2015
Sylmar, CA
we let there be M letters and one space that our random monkey can type, each with probability
A word is formed when a string of letters of any size is preceded and followed by a space, denoted by _ here. The probability of getting the word _a_ is p3, and that of an L-letter word, pL+2 . So the frequency of occurrence of any word with length L is
where c is a normalization constant. There are ML such words with lenght L. The constant c is to be determined by the convention that the frequency of occurrence of all words is normalized to 1:
The frequency of occurrence of all words of length L is
a. Show that
b. To convert word lenth L into rank, we note that the formula for fi implies that shorter words (those with smaller values of L) should occur more frequently and hence have a lower value for the rank k(L). The rank is easy to understand (the most frequently occurring word is of rank 1, the next is rank 2, etc.) but it is difficult to define mathematically. Nevertheless, argue that the rank can be exressed as
, for some constants
. This yields, upon taking the log with base M:
. Substituting this expression for L into the formula in (a), show that f satisfies the following generalized Zipf's Law:
. Show that
(hint:you need to use the technique of change of base for the logarithmic function so that you can show that f satisfies the generalized Zipf's law)