Results 1 to 5 of 5

Math Help - Query about Information Entropy formula

  1. #1
    Newbie
    Joined
    May 2007
    Posts
    11

    Question Query about Information Entropy formula

    I'm a computer programmer with a bit of rusty maths under my belt, but am now returning to education and was hoping for a little help getting to grips with Information Entropy. Specifically, I'm trying to understand this formula:



    Which as I understand it is a measure of information entropy for a random variable x with a probability mass function p(x), or said differently - the average number of bits needed to describe the random variable.

    I understand the elements that make up this formula, but I don't see how it is the average number of bits needed to describe the random variable.

    Let's take, for example, the set [a,b,c,d,e,f,g,h] with probabilities [1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64]. The formula gives the information entropy as 2 bits; meaning that on average you will need 2 bits to represent which of the set has been chosen, correct?

    If we assign these bit sequences to the set to describe each variable: [0,1,10,11,100,101,110,111], then we can calculate:

    0.75 of the time, we'll need 1 bit:

    0.75 * 1 = 0.75

    Then the same logic for how often we'll need 2 and 3 bits:

    0.1875 * 2 = 0.375
    0.0725 * 3 = 0.1875

    Add the 3 values: 0.75 + 0.375 + 0.1875 = 1.3125 bits on average.

    I don't understand why my method gives a different answer and can only assume I am misunderstanding how exactly information entropy is measured.



    As a sidenote, which may be the point I'm losing this on, I can understand that log(p(x)) gives how many bits I would need to describe X for a uniform distribution, but can't see how it is relevant for a random distribution.

    For example,

    log(1/4) = 2 bits, which makes perfect sense if I had 4 options each with an equal probability, but when the distribution is random, how is it ok to use this?

    Many thanks for helping me dust the rust off!
    Follow Math Help Forum on Facebook and Google+

  2. #2
    Senior Member
    Joined
    Apr 2006
    Posts
    399
    Awards
    1
    Quote Originally Posted by TigerTom View Post
    I'm a computer programmer with a bit of rusty maths under my belt, but am now returning to education and was hoping for a little help getting to grips with Information Entropy. Specifically, I'm trying to understand this formula:



    Which as I understand it is a measure of information entropy for a random variable x with a probability mass function p(x), or said differently - the average number of bits needed to describe the random variable.

    I understand the elements that make up this formula, but I don't see how it is the average number of bits needed to describe the random variable.

    Let's take, for example, the set [a,b,c,d,e,f,g,h] with probabilities [1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64]. The formula gives the information entropy as 2 bits; meaning that on average you will need 2 bits to represent which of the set has been chosen, correct?

    If we assign these bit sequences to the set to describe each variable: [0,1,10,11,100,101,110,111], then we can calculate:

    0.75 of the time, we'll need 1 bit:

    0.75 * 1 = 0.75

    Then the same logic for how often we'll need 2 and 3 bits:

    0.1875 * 2 = 0.375
    0.0725 * 3 = 0.1875

    Add the 3 values: 0.75 + 0.375 + 0.1875 = 1.3125 bits on average.

    I don't understand why my method gives a different answer and can only assume I am misunderstanding how exactly information entropy is measured.



    As a sidenote, which may be the point I'm losing this on, I can understand that log(p(x)) gives how many bits I would need to describe X for a uniform distribution, but can't see how it is relevant for a random distribution.

    For example,

    log(1/4) = 2 bits, which makes perfect sense if I had 4 options each with an equal probability, but when the distribution is random, how is it ok to use this?

    Many thanks for helping me dust the rust off!
    Hello. It looks like you're referring to Information entropy - Wikipedia, the free encyclopedia, which would be helpful to say.

    I think your sidenote and example explains whats going on. If p(X) = 1/4 for a uniform distribution, log(1/4) = 2 bits are required to encode the 4 possible values for X. So the assumption is that 2 bits are always required to encode an X value with p(X) = 1/4. But that's not a direct assumption. Rather it follows from the assumptions used to derive the formula: Continuity, Symmetry, Maximum and Additivity (these are listed in the above link). Specifically, I think Additivity drives this. I'd say if you want to add up subsystems, you have to treat X values with equal p(X) the same. See Entropy encoding - Wikipedia, the free encyclopedia, which comes from the above link.
    Follow Math Help Forum on Facebook and Google+

  3. #3
    Grand Panjandrum
    Joined
    Nov 2005
    From
    someplace
    Posts
    14,972
    Thanks
    4
    Quote Originally Posted by TigerTom View Post
    I'm a computer programmer with a bit of rusty maths under my belt, but am now returning to education and was hoping for a little help getting to grips with Information Entropy. Specifically, I'm trying to understand this formula:



    Which as I understand it is a measure of information entropy for a random variable x with a probability mass function p(x), or said differently - the average number of bits needed to describe the random variable.

    I understand the elements that make up this formula, but I don't see how it is the average number of bits needed to describe the random variable.

    Let's take, for example, the set [a,b,c,d,e,f,g,h] with probabilities [1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64]. The formula gives the information entropy as 2 bits; meaning that on average you will need 2 bits to represent which of the set has been chosen, correct?
    That checks out.

    If we assign these bit sequences to the set to describe each variable: [0,1,10,11,100,101,110,111], then we can calculate:
    This is not an admissible code, as knowlege of the code and the code-text
    is not sufficient to recover the source-text. Consider the message 111100
    this could be geneated from "he" or from "ddaa".

    Now consider the code (maybe not the most efficient but admissible):

    [0, 10, 110, 1110, 11110, 111110, 1111110, 11111110]

    The average number of bits per charater is now 2.03125 >= 2 as required.

    RonL
    Follow Math Help Forum on Facebook and Google+

  4. #4
    Senior Member
    Joined
    Apr 2006
    Posts
    399
    Awards
    1
    Quote Originally Posted by CaptainBlack View Post
    This is not an admissible code, as knowlege of the code and the code-text
    is not sufficient to recover the source-text. Consider the message 111100
    this could be geneated from "he" or from "ddaa".

    Now consider the code (maybe not the most efficient but admissible):

    [0, 10, 110, 1110, 11110, 111110, 1111110, 11111110]

    The average number of bits per charater is now 2.03125 >= 2 as required.

    RonL
    I think I get what you're saying. Consider the following example, which I think is the simplest.

    Suppose the characters are A, B and C with probabilities 1/2, 1/4, 1/4. According to entropy encoding, log(2) = 1 bit is needed to encode A, and log(4) = 2 bits are needed to encode B and C.

    To achieve this requires the use of a prefix-free code to tell you whether a 1-bit or a 2-bit code is being used. Prefix-free means the code for any character is never a prefix for the code for any other character. So use A = 0, B = 10 and C = 11. 'ABC' = 01011. To decode, if 0 is read, the character is A. If 1 is read, then next bit is a 0 for B or 1 for C. This encoding is optimal.

    If we tried to use fewer bits in the encoding, such as A = 0, B = 1, C = 00, the code for A is a prefix for the code for C and there would be confusion reading 'ABC' = 0100 = 'ABAA' too. Then a separator such as "," would be needed to separate the codes, so 'ABC' = 0,1,00 requiring extra bits for the separator.

    The code given by CaptainBlack (the unary encoding), uses 0 as the separator. That encoding would be A = 0, B = 10, C = 110 here. It is not optimal.

    If you have a uniform distribution on the characters, every code has the same length and you don't have to worry about prefixes.

    It is interesting that this logic falls out of the entropy formula.

    PS: Going back to TigerTom's original example, the set [a,b,c,d,e,f,g,h] with probabilities [1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64], a prefix-free encoding for this would be

    [0, 10, 110, 1110, 111100, 111101, 111110, 111111]

    which uses the optimal average of 2 bits compared to the 2.03125 bits of the unary encoding given by CaptainBlack:

    [0, 10, 110, 1110, 11110, 111110, 1111110, 11111110]
    Last edited by JakeD; May 29th 2007 at 12:41 AM.
    Follow Math Help Forum on Facebook and Google+

  5. #5
    Newbie
    Joined
    May 2007
    Posts
    11

    Smile Thanks - beginning to make a lot more sense...

    Quote Originally Posted by JakeD View Post
    Hello. It looks like you're referring to Information entropy - Wikipedia, the free encyclopedia, which would be helpful to say.
    Ok, thanks

    Quote Originally Posted by CaptainBlack View Post
    This is not an admissible code, as knowlege of the code and the code-text is not sufficient to recover the source-text. Consider the message 111100 this could be geneated from "he" or from "ddaa".
    Ahhh, ok. After reading the links about entropy encoding and prefix codes, this is beginning to make sense.

    Quote Originally Posted by CaptainBlack View Post
    Now consider the code (maybe not the most efficient but admissible):

    [0, 10, 110, 1110, 11110, 111110, 1111110, 11111110]

    The average number of bits per charater is now 2.03125 >= 2 as required.
    The book I am reading gives this code [0, 10, 110, 1110, 111100, 111101, 111110, 111111]. On reading this I had no clue how it was reached and thus how they decided it meant an average description length of 2 bits.

    Now I can calculate the same answer myself, and see that the bit lengths match up with -log2(p(x)) for the probabilities and this is optimal code length according to Shannons source coding theorem.

    Bits and pieces are starting to fit together!

    Quote Originally Posted by JakeD View Post
    PS: Going back to TigerTom's original example, the set [a,b,c,d,e,f,g,h] with probabilities [1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64], a prefix-free encoding for this would be

    [0, 10, 110, 1110, 111100, 111101, 111110, 111111]

    which uses the optimal average of 2 bits compared to the 2.03125 bits of the unary encoding given by CaptainBlack
    Yes - the same as is used in my book.


    Thanks so much for helping me along the way. I've not had to do any heavy or new maths for 7 years, since high school, so diving back in with this has meant I have so many questions it is hard to tie things together. Both of you have really helped.
    Follow Math Help Forum on Facebook and Google+

Similar Math Help Forum Discussions

  1. Information Entropy Problem
    Posted in the Advanced Applied Math Forum
    Replies: 6
    Last Post: September 5th 2011, 11:01 AM
  2. Replies: 0
    Last Post: August 3rd 2010, 08:02 PM
  3. Calculating Entropy using Shannons Formula
    Posted in the Advanced Math Topics Forum
    Replies: 3
    Last Post: January 14th 2009, 01:20 PM
  4. Entropy
    Posted in the Discrete Math Forum
    Replies: 2
    Last Post: October 6th 2008, 11:48 AM
  5. Entropy
    Posted in the Advanced Applied Math Forum
    Replies: 4
    Last Post: December 27th 2007, 05:58 AM

Search Tags


/mathhelpforum @mathhelpforum