(Please correct me if I'm wrong)
Here is my logic:
Let n be the sample size.
Let p' be the sample mean.(which is different from true mean).
Let S = np' be the no of smokers in the sample.
Let e = error rate in %(which is given as 0.005 in the question)
Concept of error
The error, if given as a percentage like 0.5% of the sample population n (=0.005), tells us about the standard deviation in terms of n.
Since, standard deviation of the binomial distro is ,
The absolute value of the error is then 0.005n.
In this question, we are concerned with the error of the sample mean(np' - np).
Hence, its .
And we can set up the inequality:
(np - ne) < S < (np + ne)
with which to start solving the question numerically.
Concept of Confidence Interval
The range of S which we defined earlier must constitute 95% of the probabilities. The range of S is huge, stretching from 0 smoker to n (=100%) smokers. But the tail ends are not likely. Only the ones centred around the mean (plus or minus the standard deviation) matters. In fact, they constitute 95% of all probabilities = having a 95% chance of containing the true mean = 95% confidence interval.
Plotted on the standard normal curve, its the area within the range -1.96 to 1.96. These values and area are INVARIANT for the confidence interval of 95%. In fact, we use it so often we call it "within 2 standard deviations".
How to calculate sample size
We are given a fixed error rate (a number we cannot exceed but can go down), and a fixed confidence interval, but we can vary the n. We find a relationship between e and n:
So as n --> , e --> 0.
(can anyone help me out here?...I 'm trying to find an expression for e < n)