Information entropy - Wikipedia, the free encyclopedia, which would be helpful to say.
I think your sidenote and example explains whats going on. If p(X) = 1/4 for a uniform distribution, log(1/4) = 2 bits are required to encode the 4 possible values for X. So the assumption is that 2 bits are always required to encode an X value with p(X) = 1/4. But that's not a direct assumption. Rather it follows from the assumptions used to derive the formula: Continuity, Symmetry, Maximum and Additivity (these are listed in the above link). Specifically, I think Additivity drives this. I'd say if you want to add up subsystems, you have to treat X values with equal p(X) the same. See Entropy encoding - Wikipedia, the free encyclopedia, which comes from the above link.