1. ## t-test

I am trying to calculate some t-test results for a rather large dataset. The dataset has been log transformed because it has a positive skew. The distributions I am trying to compare are independent and are different sizes, so I am using Welch's approximation. I get the below equation:

(xbar1-xbar2)/sqrt(s1/n1)+(s2/n2)

(5.74-4.32)/sqrt((0.50/21936158)+(0.16/65008))

where xbar is the mean of the sample, s is the variance (the square of the standard deviation), and n is the number of samples.

The variance of the log transform is obviously quite small, but the amount of samples is large. I am therefore getting a very large negative number for the t-test result. This can't be right can it? I think I must be misinterpreting one of the parameters, but I'm not sure which. When I try to calculate the degrees of freedom I get a ridiculous number too.

Thanks for any assistance

2. Originally Posted by newbie2008
I am trying to calculate some t-test results for a rather large dataset. The dataset has been log transformed because it has a positive skew. The distributions I am trying to compare are independent and are different sizes, so I am using Welch's approximation. I get the below equation:

(xbar1-xbar2)/sqrt(s1/n1)+(s2/n2)

(5.74-4.32)/sqrt((0.50/21936158)+(0.16/65008))

where xbar is the mean of the sample, s is the variance (the square of the standard deviation), and n is the number of samples.

The variance of the log transform is obviously quite small, but the amount of samples is large. I am therefore getting a very large negative number for the t-test result. This can't be right can it? I think I must be misinterpreting one of the parameters, but I'm not sure which. When I try to calculate the degrees of freedom I get a ridiculous number too.

Thanks for any assistance
How large are the samples and how large the skew?

CB

3. ## t-test

I should add some background. I am looking at the distribution of a species throughout Europe. I have an area where the species is present, and an area where it is not. The two areas are independent as it is only presence / abscence. Over each area I also have precipitation distribution. I want to compare the precipitation distribution for the area where the species is present, to the area where it is not.

So I assume that in this case I have two populations. One where species is present, which is represented by the amount of pixels, which is 65088. The second is where the species is not present, which is 15641137. I have a precipitation distribution for each. The first is positivley skewed (1.23) I haven't measured the skewness of the second yet but it is also positively skewed.

I want to state whether the two precipitation distributions are significantly different. So I have log transformed them, and I was going to compare the means and the kurtosis. I thought that a t-test would also help me compare the significance of the difference betwen them. Is this fair?

I have tried the t-test on the population, and I can't seem to get a sensible answer. Do I instead only need to take a representative sample? If so, how would I calculate the sample size I need?

4. One more point to add, if I am taking samples, is it better to have them equal in size, even though the populations and variances will be differently sized, or to have it as a proportion of the population?

Thanks

5. Originally Posted by newbie2008
I have tried the t-test on the population, and I can't seem to get a sensible answer. Do I instead only need to take a representative sample? If so, how would I calculate the sample size I need?
Yes you do a t-test for the population on a represetative sample. There are assumptions and conditions for doing such a test. You should make sure your data does not violate these.

If you are unsure of your distribution type you may want to employ a distribution free test like a Wilcoxon Sign Rank Test.

You wouldn't do a t-test on the entire population because if you have the entire population you can find the means, stdev's etc.

The sample size required depends on the confidence level you require.

There are many online calculators that can determine this. Here's one

Sample Size Calculator - Confidence Level, Confidence Interval, Sample Size, Population Size, Relevant Population - Creative Research Systems

6. Ok I'm getting there, I just have a couple more questions...

a) I've had a look at the Wilcoxon test you suggested. As the groups are independent, would it be more appropriate to use the Mann Whitney-Wilcoxon test?

b) I'm still having a little difficulty defining to myself what is the sample and what is the population! What I have is a distribution purporting to be the population of a species over Europe (65088 pixels). However it is obviously not totally accurate due to the resolution of the data, and measurement errors (The error has been estimated so I can state this). Does this mean that it is instead actually still a sample of an unknown actual population?

c) In terms of sample sizes, if I use a 10 000 pixel sample size, I can say that I am 99% confident that the sample I use will be representative of the population (i.e. that the mean precipitation of my sample will be the mean precipitation of the actual population) at a confidence interval of 1.29. Is this a percentage (i.e. 1.29% above and below the mean precipitation?)

Just trying to get it clear in my head!