# how to estimate how different two sets are?

• Dec 22nd 2009, 04:14 PM
bagatur
how to estimate how different two sets are?
Hello,

I have the following problem. I have a group of samples. For simplicity lets say 10 samples s1 through s10. I randomly split them into two sets of equal size (both sets put together make up the whole group of samples). Lets for the sake of example and clarity make sets as following:
set 1: s1, s4, s6, s9 and s10
set 2: s2, s3, s5, s7 and s8.
Now I do it all over and create another selection of two equal sized random sets, for example:
set 1': s1, s3, s4, s8 and s10
set 2': s2, s5, s6, s7, and s9.
My question is how can I determine how different these two selections are? It might be easy to do visually when you have 10 samples, but I have about 50 samples and used 10 only to clearly explain my problem. Is there some kind of statistics I can calculate which would let me determine if my selections 1,2 and 1', 2' differ by more than just a couple of samples interchanged between the sets?
Thank you :)
• Dec 23rd 2009, 04:36 PM
novice
Yes, in statistics there is a method for checking the difference. The method is called Test of Hypothesis of Difference of Sample mean. You get to decide what you want to tolerate to differ by 1, 2, 0r 10 by the setting the level of significance. If you can only tolerate 1% difference, the level of significance is 1%. By standard, 1% is highly significant, 5% probably significant.

You might want to read the chapter on Test of Hypotheses. I am sure you can find it in any statistics book.
• Dec 23rd 2009, 05:59 PM
bagatur
Novice, you completely missed the point of my question. I am not talking about values of those samples, I am talking about how they are distributed into groups. Calculating sample means comes next.
And when I find out that set 1 and set 2 have significantly different sample means, and also set 1' and set 2' have significantly different sample means, I want to know if it is simply because set distributions are basically the same with only a couple of samples different, which didnt affect sample means too much, or if I have two completely different distributions which produced significantly different sample means.
I hope that sheds more light on the problem that I have (Wink)
• Dec 26th 2009, 06:33 AM
novice
If you are talking about set distributions, then consider the test for difference of sample proportions. The methods that I know are: (1) Test of the difference of sample means, (2) test of the difference of standard deviation of means, (3) test of the difference of proportion of sample means.

If you are talking about sorting physical objects, such as that people do at a recycling centers, you can sort out 50 samples much faster than doing math.