I'm a computer programmer with a bit of rusty maths under my belt, but am now returning to education and was hoping for a little help getting to grips with Information Entropy. Specifically, I'm trying to understand this formula:
Which as I understand it is a measure of information entropy for a random variable x with a probability mass function p(x), or said differently - the average number of bits needed to describe the random variable.
I understand the elements that make up this formula, but I don't see how it is the average number of bits needed to describe the random variable.
Let's take, for example, the set [a,b,c,d,e,f,g,h] with probabilities [1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64]. The formula gives the information entropy as 2 bits; meaning that on average you will need 2 bits to represent which of the set has been chosen, correct?
If we assign these bit sequences to the set to describe each variable: [0,1,10,11,100,101,110,111], then we can calculate:
0.75 of the time, we'll need 1 bit:
0.75 * 1 = 0.75
Then the same logic for how often we'll need 2 and 3 bits:
0.1875 * 2 = 0.375
0.0725 * 3 = 0.1875
Add the 3 values: 0.75 + 0.375 + 0.1875 = 1.3125 bits on average.
I don't understand why my method gives a different answer and can only assume I am misunderstanding how exactly information entropy is measured.
As a sidenote, which may be the point I'm losing this on, I can understand that log(p(x)) gives how many bits I would need to describe X for a uniform distribution, but can't see how it is relevant for a random distribution.
log(1/4) = 2 bits, which makes perfect sense if I had 4 options each with an equal probability, but when the distribution is random, how is it ok to use this?
Many thanks for helping me dust the rust off!