I'm a computer programmer with a bit of rusty maths under my belt, but am now returning to education and was hoping for a little help getting to grips with Information Entropy. Specifically, I'm trying to understand this formula:

Which as I understand it is a measure of information entropy for a random variable x with a probability mass function p(x), or said differently - the average number of bits needed to describe the random variable.

I understand the elements that make up this formula, but I don't see how it is the average number of bits needed to describe the random variable.

Let's take, for example, the set [a,b,c,d,e,f,g,h] with probabilities [1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64]. The formula gives the information entropy as

**2 bits**; meaning that on average you will need 2 bits to represent which of the set has been chosen, correct?

If we assign these bit sequences to the set to describe each variable: [0,1,10,11,100,101,110,111], then we can calculate:

0.75 of the time, we'll need 1 bit:

0.75 * 1 = 0.75

Then the same logic for how often we'll need 2 and 3 bits:

0.1875 * 2 = 0.375

0.0725 * 3 = 0.1875

Add the 3 values: 0.75 + 0.375 + 0.1875 =

**1.3125 bits on average**.

I don't understand why my method gives a different answer and can only assume I am misunderstanding how exactly information entropy is measured.

As a sidenote, which may be the point I'm losing this on, I can understand that log(p(x)) gives how many bits I would need to describe X for a uniform distribution, but can't see how it is relevant for a random distribution.

For example,

log(1/4) = 2 bits, which makes perfect sense if I had 4 options each with an equal probability, but when the distribution is random, how is it ok to use this?

Many thanks for helping me dust the rust off!