Hey aivoryuk.
What are you trying to look at specifically (i.e. the attributes)? What are you trying to answer? What is the directed goal of your experiment?
Hi
I am looking at some data which looks at 2 sets of data which have the same attributes but are not equal in their size, one group is 6000+ and the other is about 160.
I am trying to look at the difference in the attributes between the 2 sets so I have done some descriptive statistcs, a t-test with unmatched and also building some histograms for both sets.
I am just unsure because of the big size differences between the groups that I am going to come to any valid conclusion.
Is there any advice anyone can give?
Regards
Alex
H
I have 2 data sets which have the same 24 attributes/columns etc. I would like to compare each attribute/column between the 2 data sets to see if there is a significant difference between the values.
Once I have ascertained that their is a difference for each attribute, I would like to know what that difference is. It maybe for one attribute that the 1st data set has a higher average value than the other.
With the data sets being so different in size I am not sure if this would skew the interpretation.
What kind of difference though? What is the data type? Is it continuous or discrete data? Categorical data? Are you just comparing each attribute in each data set? What kind of test do you want to use? Have you checked whether your data and the context of the investigation support this?
If you are testing differences then the typical techniques include 2-sample t-tests and the non-parametric equivalents. T-tests assume all data points are independent, the sample mean as a rough normal distribution (large enough samples ensure this by CLT) and that sample variance is roughly chi-square (same sort of argument) and that the sample mean and sample variance are independent to each other (they may not be).
If these are met consider doing a t-test and take note that t-tests allow for different sample sizes in each sample.
Also consider whether you want to use a pooled test, paired test or un-paired/un-pooled test. If there is a relationship between the data sets then consider using a paired t-test.
Hi thanks for your time
yes the data is continuous and I am compating each attribute from each so as there are 24 attributes I would be performing 24 t-tests (or whatever test is appropriate).
It would appear that t-tests are the way to go?
If you have reason to believe the data is Poisson, exponential, or one of those other distributions where mean and variance are related then you shouldn't use this so you may want to take a look at the histogram and see if there is any particular pattern there before you continue.
Thanks I have just completed the Histograms for all the attributes.
Here is aan example
Set A
Bin Frequency Cumulative %
0 5543 83.71%
1 796 95.73%
2 224 99.11%
3 48 99.83%
4 7 99.94%
5 4 100.00%
6 0 100.00%
More 0 100.00%
Set B
Bin Frequency Cumulative %
0 125 77.16%
1 29 95.06%
2 8 100.00%
3 0 100.00%
4 0 100.00%
5 0 100.00%
6 0 100.00%
More 0 100.00%
the majority of the histograms have this pattern.
Where would this data fall in regards to the patterns you have described.
Thanks for your help so far.
One good indicator of distributions with mean tied to variance is the skewness of that distribution. If you can get a statistical measure of the kurtosis or skewness of the distribution then that would be a good indicator to report.
There is no certainty in this kind of thing when you are dealing with data, but it's always wise to make sure things don't clearly violate the assumptions of techniques to make them useless.