Hello,

I'm using the 'Ranksum' formula in Matlab, which I believe is their equivalent of the Mann Whitney rank sum test. I am using frequency distribution data. Say I had two datasets as below:

(a)
x 1 2 3 4 5
a 2 3 4 0 0
b 3 4 2 1 1

(b)
x 1 2 3 4 5
a 2 3 4
b 3 4 2 1 1

where column x is an incremental value, a is a frequency distribution from sample 1 and b is a frequency distribution from sample 2. This is just junk data btw I haven't tried it on this. Basically the difference between the two datasets is that in (a) I have shown that the frequency distribution at 4 and 5 is 0, whereas in (b) I have shown it as NaN. As it is a rank sum test, this changes the result of the test, and consequently whether my actual data is significant or not! My question is, which is the better way to represent frequency distribution data in this instance? I suspect it is probably (b) because you are essentially distributing a set amount of points throughout x, and when there are no points in 4 and 5, this should be no data, but I just want to clarify if this is correct! The ranksum allows two datasets of different lengths, incidentally.

Thanks in advance.