# Math Help - sample size and accuracy of generalizations

1. ## sample size and accuracy of generalizations

Hi, all.

I'd like to ask you folks about sampling, and the relationship between the sample size and the accuracy of a generalization about a population. What I most want to know is, under what circumstances does the accuracy of a generalization depend on the ratio of a sample size to the total population, as opposed to simply the sample size.

For example...

Suppose 1,000 squirrels populate a fenced-in park, and you want to determine what percentage of that population has been infected with Yersinia pestis. To estimate this, we have available to us a random sample of $n_1$ squirrels.

Now suppose that 10,000 squirrels populate a second fenced-in park, and you also want to determine what percentage of that population is infected. For this population, we have available to us a random sample of $n_2$ squirrels.

If $n_1=n_2$, what can we say about the comparative accuracy of the tests? How is accuracy even measured?

Also, how large would $n_2$ have to be, with respect to $n_1$, in order for the accuracy of the two estimates to be about the same?

I hope I've been clear enough about what I'm asking, because to be honest I'm unsure how to articulate it. Anyway, your help would be much appreciated!

Thanks!

2. Originally Posted by hatsoff
Hi, all.

I'd like to ask you folks about sampling, and the relationship between the sample size and the accuracy of a generalization about a population. What I most want to know is, under what circumstances does the accuracy of a generalization depend on the ratio of a sample size to the total population, as opposed to simply the sample size.

For example...

Suppose 1,000 squirrels populate a fenced-in park, and you want to determine what percentage of that population has been infected with Yersinia pestis. To estimate this, we have available to us a random sample of $n_1$ squirrels.

Now suppose that 10,000 squirrels populate a second fenced-in park, and you also want to determine what percentage of that population is infected. For this population, we have available to us a random sample of $n_2$ squirrels.

If $n_1=n_2$, what can we say about the comparative accuracy of the tests? How is accuracy even measured?

Also, how large would $n_2$ have to be, with respect to $n_1$, in order for the accuracy of the two estimates to be about the same?

I hope I've been clear enough about what I'm asking, because to be honest I'm unsure how to articulate it. Anyway, your help would be much appreciated!

Thanks!
First read this: Stats: Estimating the Proportion

Since $\sqrt{\frac{pq}{n}}$ varies little for small changes in p, the substitution of $\hat{p}$ for p and $\hat{q}$ for q produces little error in calculating the exact value.

The assumption that the normal distribution is a good approximation for the binomial distribution is OK when n is 'sufficiently large' ..... How large 'sufficiently large' is depends on p ...... The rule of thumb is np > 5 and n(1-p) > 5. So you won't know for sure whether n is sufficiently large since you don't know p. But you can use $\hat{p}$ to get an idea as to whether n was 'sufficiently large'.

If n is small, I have some thoughts but no time at the moment. I'll continue later unless someone else does.

3. Originally Posted by mr fantastic
First read this: Stats: Estimating the Proportion

Since $\sqrt{\frac{pq}{n}}$ varies little for small changes in p, the substitution of $\hat{p}$ for p and $\hat{q}$ for q produces little error in calculating the exact value.

The assumption that the normal distribution is a good approximation for the binomial distribution is OK when n is 'sufficiently large' ..... How large 'sufficiently large' is depends on p ...... The rule of thumb is np > 5 and n(1-p) > 5. So you won't know for sure whether n is sufficiently large since you don't know p. But you can use $\hat{p}$ to get an idea as to whether n was 'sufficiently large'.

If n is small, I have some thoughts but no time at the moment. I'll continue later unless someone else does.
In the small sample case (that is, when n is not large enough to use a normal approximation) finding a confidence interval is tedious.

I should also point out that when sampling from a small population (that is, the sample size is more than 10% of the population size, say) then the standard error for the proportion involves the finite population correction factor $\sqrt{\frac{N-n}{N-1}}$ where N is the population size and n is the sample size.

4. I only understand about half of what was posted (I'm still wading through calc 3, and haven't made it to stats/probability), but from what I can gather, it seems that in this case the confidence of a $\hat{p}$ score does not depend at all on the total population, but on the sample size $n$ alone.

So, if $n_1=n_2$, and $\hat{p_1}=\hat{p_2}$, then the confidence for each $\hat{p}$ value is the same, no matter how different are the sizes of the population... or do I have this completely wrong?

I mean, obviously the population sizes would place absolute limits on the error. For example, if you sample 10 members of a population of 100, and you get 5 positive results, then your error cannot be more than 45%, whereas if you had a 5/10 result from a pop. of 1000, your error could technically be as great as 49.5%. When working with larger numbers (in the millions, e.g.), does the ratio of n/pop. matter, or is it simply n that makes a difference, independent of the pop. size?

5. Originally Posted by hatsoff
I only understand about half of what was posted (I'm still wading through calc 3, and haven't made it to stats/probability), but from what I can gather, it seems that in this case the confidence of a $\hat{p}$ score does not depend at all on the total population, but on the sample size $n$ alone.

So, if $n_1=n_2$, and $\hat{p_1}=\hat{p_2}$, then the confidence for each $\hat{p}$ value is the same, no matter how different are the sizes of the population... or do I have this completely wrong?

I mean, obviously the population sizes would place absolute limits on the error. For example, if you sample 10 members of a population of 100, and you get 5 positive results, then your error cannot be more than 45%, whereas if you had a 5/10 result from a pop. of 1000, your error could technically be as great as 49.5%. When working with larger numbers (in the millions, e.g.), does the ratio of n/pop. matter, or is it simply n that makes a difference, independent of the pop. size?
There are different cases that all boil down to the size of the sample, the size of the sample compared to the size of the population, and the (unknown) value of p.

If the sample size n is 'large', but 'small' compared to the population size, and if p is not too extreme (close to 0 or 1) then the confidence interval depends only on n.