I'm having a lot of trouble with what *should* be quite simple to understand. But I have thrown many hours at searching for ways of understanding how to apply the 'mutual information' process to my problem. Essentially I seek to use mutual information as a means of feature selection.

So pressume I have a data set of samples of a 2D function: each sample has two input variables X1 and X2 and there is a Y value that represents the function output with those two inputs. Assume I choose samples using latin-hyper-cube method with the inputs bound between -1.0 and 1.0. So I want to make a decision on which input variable, X1 or X2 is more important in deciding Y, using mutual information.

Now, I assume I need to look at the mutual information method using continuous random variables: once for I(X1;Y) and once for I(X2;Y) (using the general notation I've seen). So lets say for example I want to find the mutual information score for I(X1;Y). I'm very stuck on how to do this.

The formula I see shows:

SS f(x,y)log(f(x,y)/f(x)f(y))dxdy

where S is an integral symbol, f(x,y) is the joint p.d.f. of variables X1 and Y, while f(x) and f(y) are to be the p.d.f. of X1 and Y respectfully. Assume the log base is 2.

And that's where I get confused. For one, I'm not sure what the p.d.f's are, f(x) and f(y) and searching online hasn't helped me feel sure. My best guess for f(x) is a uniform distribution from -1 to 1... but f(x) completely loses me. At best I can imagine doing a normal p.d.f. using the mean and stdev of the Y values.

I'm stuck at the join distribution. Any guidance would be so very much appreciated!

Thank you,