# Thread: Z score, t score, CLT? I don't remember.

1. ## Z score, t score, CLT? I don't remember.

I hope I've posted in the correct forum, because this seems not like the high school statistics I learned.

I've got two questions:

I have a set of data. The population size is 263000. The mean score is 7.578, the standard deviation is 1.274.

A subset of the data has a particular characteristic that I want to determine if it is a contributing factor in the results.
The sample (subset) size is ~60000 with a mean of 8.348 and a standard deviation of 1.175.

What is the probability, statitical significance etc, that the sample could have a mean so different from the population? In other words, is it a contributing factor in results? I'm not certain what I'm looking for or how to calculate it. I've done Z-scores at uni a long time ago, but a friend started talking about t scores and just lost me. Central Limit Theorem rings a bell somewhere, but I'm not sure what to do.

The second question is similar to the above question, but relating to proportions.
Within the population, 0.051 "failed", within the sample 0.012 "failed". What is the probability, statistical significance etc, that this could occur by random change? In other words, does this characteristic seem to have a very strong influence on results?

2. Originally Posted by MichaelF
I hope I've posted in the correct forum, because this seems not like the high school statistics I learned.

I've got two questions:

I have a set of data. The population size is 263000. The mean score is 7.578, the standard deviation is 1.274.

A subset of the data has a particular characteristic that I want to determine if it is a contributing factor in the results.
The sample (subset) size is ~60000 with a mean of 8.348 and a standard deviation of 1.175.

What is the probability, statitical significance etc, that the sample could have a mean so different from the population? In other words, is it a contributing factor in results? I'm not certain what I'm looking for or how to calculate it. I've done Z-scores at uni a long time ago, but a friend started talking about t scores and just lost me. Central Limit Theorem rings a bell somewhere, but I'm not sure what to do.

The second question is similar to the above question, but relating to proportions.
Within the population, 0.051 "failed", within the sample 0.012 "failed". What is the probability, statistical significance etc, that this could occur by random change? In other words, does this characteristic seem to have a very strong influence on results?
I would not be happy giving advice here because we don't know how your sample has been selected. For instance if you had 10 samples and now we are looking at the most extreme of those you will need a different answer from that you would get if this were the only question being asked of the data.

You may well need more sophisticated tools than the elementary statistics you suggest in this thread, we have no way of knowing.

CB

3. Ok...a context for the question. The sample is a subset of the population. In this case, the population of all the year 9 students in a particular state and their results.

Students have different characteristics, some tall, short, fat, thin etc. Without saying what the characteristic is, about 60000 students are known to have a particular characteristic shared by none of the other students of the population. So it isn't a random sample. I guess the question is what is the probability that the sample is a could occur by random selection?

If the probability of this sample occuring by chance is 10%, then we might conclude that this characteristic is a influencing factor. If it has a probability of 0.0000000000005% then we could argue that it is a determining factor.

I'm not asking anyone to solve it (although it would be great), but I'm after some pointers in the right direction. What things should I consider? With the second question on proportion I tried:

(sample proportion - population proportion) / sqrt[ {population proportion *( 1- population proportion)} / sample size]

$\frac {\left \widehat{\rho} - \pi \right}
{\sqrt {\frac {\pi( 1- \pi)} {n}}} =?$

but that got me the -43.5 standard deviations, which doesn't sound right. I asked a friend who said they knew a bit about stats, and they said it was impossible.

I need to learn latex.

4. Here's a review....

IF the data comes from a NORMAL population, then...

${\bar X-\mu\over \sigma/\sqrt{n}}$ is a st normal rv.

IF the data comes from a NORMAL population and you do not know the pop variance, then...

${\bar X-\mu\over s/\sqrt{n}}$ is a t rv. with n-1 degrees of freedom.

NOW, if you don't have normality you can use the central limit theorem IF n is large, then...
either of the two above are APPROXIMATELY a st normal.

So, if you are sampling from a binomial distribution, then

${\hat p-p\over \sqrt{pq/n}}$ is approximately a st normal rv.

HOWEVER if p is near 0 or 1 you have problems and in that case the limit approaches a Poisson.
This seems to be the problem since your p is near zero.

5. Originally Posted by MichaelF
Ok...a context for the question. The sample is a subset of the population. In this case, the population of all the year 9 students in a particular state and their results.

Students have different characteristics, some tall, short, fat, thin etc. Without saying what the characteristic is, about 60000 students are known to have a particular characteristic shared by none of the other students of the population. So it isn't a random sample. I guess the question is what is the probability that the sample is a could occur by random selection?
But if you have selected this charateristic after exmining the sample for something that looks extreme all bets are off as far as anaysis is concerned. You will need to consider all possible selection criteria before you can do anything sensible.

Consider a population with mean $\displaystyle m$ and SD $\dispalystyle s$ and 10 subpopulations with means and SDs $m_1, ..., m_{10}$ , $s_1, ..., s_{10}$. If we now pick out the subpopulation with the largest mean and ask the probability that this occured by chance we need to consider all the other subpopulations and not just the one with the largest mean.

CB

6. I was thinking about what you guys said over the last day or so. Thanks btw.
I don't think this is a Poisson distribution because it isn't a rare event, at least from my perspective. There are 5 groups of students, one of which has this particular trait. I'll admit that I don't know enought to determine whether this is a Poisson distribution, and its' significance if it is.

This group does have the largest mean, but the characteristic of this group is common, it's not unusual at all. The other groups (sub groups) have, in this regard, a similar but different characteristic.

CB, If I were to consider the other groups, because I can do the calculations for them too, how would that help me understand the significance of this trait?

Thanks,
MichaelF