Binomial error paradox

Oct 2012
751
212
Ireland
[FONT=arial, helvetica, clean, sans-serif]In a binomial distribution if the chance of something happening is p and q=(1-p) and the variance for the event occurring is pq. The variance for the event not occurring is also qp. Since the error within a given confidence interval is related to the variance then this error is the same for both the even occurring and not occurring. In my particular case p=0.1 and with a 95% confidence interval the value of p lies between 0.096 and 0.104 or 0.1+-0.004
[/FONT]
I am looking to find the relative error associated with something happening [FONT=arial, helvetica, clean, sans-serif]which would be 0.004/0.1= 4%
[/FONT]I want to keep gathering data until the relative error is below 3% but here is what confuses me.
[FONT=arial, helvetica, clean, sans-serif]The chance of the event not occurring is 0.9 with the same interval as the chance of it happening (between 0.894 and 0.904 or 0.9+-0.04) and the relative error is 0.004/0.9= 0.44% which is below the 3% I am aiming for.
[/FONT][FONT=arial, helvetica, clean, sans-serif]It does not make sense to be content with the error of the event occurring but not be content with the error of it occurring.[/FONT]
[FONT=arial, helvetica, clean, sans-serif]How do people handle relative error in a binomial distribution? Perhaps take an average, or only look at the error in the confidence interval?[/FONT]

  • 2 hours ago
  • - 4 days left to answer.
 

chiro

MHF Helper
Sep 2012
6,608
1,263
Australia
Hey Shakarri.

If you are trying to find the relative difference between the mean and the standard error then simply set up the inequality and solve for n.

If you are using a Wald test or a normal approximation then the standard error of the mean with a binomial is se = SQRT(p_hat*(1-p_hat)/n) where p_hat is the estimated value of the proportion which is just the mean of the sample data.

So you are looking at [1.96*se]/p_hat < t where t is your threshold (3% = 0.03) so extracting n we get:

(1.96)^2*(1-p_hat)/(p_hat*t^2) < n or
n > (1.96)^2*(1-p_hat)/(p_hat*t^2)

So you can find the first integer satisfying that condition and you have your sample size.

If you want to consider that p_hat can fluctuate within a specific range then you will need to do this for the lower and upper bounds and combine both information to get a value for n.
 
Oct 2012
751
212
Ireland
I am afraid you have misunderstood my question, but thanks for trying.
I am using a normal approximation and the standard error multiplied by 1.96 is 0.004

I am using the formula [1.96*se]/p_hat = t as in your response but for finding t for the current sample size n
The problem is that when applying this equation to the chance of the event occuring (p_hat= 0.1) t= 0.004/0.1= 0.04 which I consider to be too high and so more data would need to be gathered
BUT
Applying the same formula to the chance of the event not occurring (p_hat= 0.9) t=0.004/0.9= 0.0044 which is a low enough value of t and no more data would need to be gathered.
t in this case is almost 10 times lower than t for the chance of the event occurring.

So on one hand I am not sure if 0.1 is accurate for the chance of something happening, but on the other hand I am sure that 0.9 is accurate for the chance of it not happening. I hope you can see why this doesn't make sense.
 
Last edited:

chiro

MHF Helper
Sep 2012
6,608
1,263
Australia
Well in terms of entropy, the maximum value is at p = 1/2, but what you could do is again use upper and lower bounds and combine the information to get an estimate.

You can make the bounds whatever you like and you can even make them dependent on an existing sample and each new observation you obtain.

The reason I mention entropy is that the point of the highest entropy is where the highest point of uncertainty. You also need a relatively lower sample for low points of entropy which is why considering where the highest amount of entropy (and the lowest) is something to really consider when you want to do these kinds of calculations.

Apart from either using a guess based on a prior distribution, or updating your guess with each new observation that is added to your appended sample, I can't really suggest anything else.
 
Oct 2012
751
212
Ireland
This is not about the upper and lower bounds it is about the reference point used when looking at the relative error. It isn't clear what the relative error should be relative to.

Forget about the numbers, if the chance of something happening is p and the chance of it not happening is q and I am taking a 95% confidence interval
The standard error or the difference between the upper bound and the mean is 1.96*(pq/n)^1/2
This error relative to p is [1.96*(pq/n)^1/2]/p
The error relative to q is [1.96*(pq/n)^1/2]/(1-p)
Which relative error is correct? If p<0.5 then the first figure is higher than the second. But why should they be higher? Why would I be more certain about the chance of something happening than I am about the chance of it not happening? Since q is related to p I cannot be certain about 1 and uncertain about the other.
Ultimately my question is how do people find a relative error that avoids this paradox of being more certain of q than they are of p
 

chiro

MHF Helper
Sep 2012
6,608
1,263
Australia
Hint: Read into entropy.