The variance of a poisson distribution is equal to the mean rate so if you find an estimate for the mean rate then you have an estimate for the variance
Hi!
If X is normal distributed then Variance is X^2 distributed.
and if nr of samples is k>50 then it is approximately Normal distr.
Now, what if X is Poisson distributed (or Binom)
What could be said about the distributin of the variance?
And how many samples k is enough to be approximately Normal distr?
Hope you can help!
Thanks!
Hi! Yes thats basic, but that is unfortunatley not what I ment. I'm trying to
get confidence interval for the variance given real world estimates of variance.
That is sloppy "the variance of variance"
Since X is not Norm but Poiss I can not use the X^2 distrib.
If you are looking for the variance of the sample variance so that you can get a confidence interval of the variance then that isn't necessary, if you have a confidence interval for the mean then it should also be a confidence interval for the variance. You can find exact ways of finding the confidence interval of the mean for a poisson distribution instead of assuming that the mean has a normal distribution. Wikipedia has an exact formula Poisson_distribution#Confidence_interval
Thanks for your reply Shakarri! Interesting: "confidence interval for the mean then it should also be a confidence interval for the variance" Is that true? Do you have a link on that as well? I feel if you ignore the mean (because it is biased eg due to non calibrated instrument) and are trying to estimate mean through variance then the confidense interval would be bigger. In a practical example the factor btw mu and varians for a poisson is selldom the constant 1 often lager, just as an example of the uncertenty. I feel if you making an c.i. for mean measuring mean directly then this would be narrower than making a confidens interval for mean measuring variance. Hope for some input!
I'd like to add something to this. Since I'm interested in estimating mean measuring variance I would also like a comment on this article.
http://data.princeton.edu/wws509/notes/c4addendum.pdf
In page 2: "Several authors have proposed estimating "the constant btw mean and variance" using Pearson’s chi-squared statistic divided by its degrees of freedom"
I know this is the Chi2 distribution but I don't know how to aply it in a real worls example. I would like a short example to grasp that. say I have 6 values what "p" should I use. what is n in my case, and X.. everything there really.
The confidence intervals of the variance and of the mean would not be the same. I expect the c.i. for the variance would be larger. I do not have a reference for being able to apply a the c.i. of the mean as a c.i. for the variance but there is a more simple example of this being used.
For a normal distribution the c.i. of the mean is
And the c.i. of the median m is
Since we can turn the first expression into
Being a probability distribution doesn't stop us from making substitutions like this. You will notice that the first equation has a smaller uncertainty range than the second because (pi/2)>1 so it would make sense to use the mean to estimate the median. But there are some advantages to using the median, if you sample one extremely large value it will change the mean quite a lot but it will not affect the median much so this gives the median an advantage. The median is "robust" against extremely large or small samples. If you use equation (1) to estimate the median then you lose this ability to be robust.
So I hope that clears up why a c.i. for the variance would be suitable for the mean. Unlike the median being robust I don't think the variance has any advantage over the mean except as you said in your case the mean is biased.
You are using a poisson distribution to model the number of events that occur in a time interval. The number of events that occur will be one of {0, 1, 2, ..., M}, i.e. any whole number between 0 and M . The value of p you are looking for is the number of "groups" you have which is equal to M+1.
You need to decide what you think the probabilities of each of these is, not based on data but based on some theory of how you think the system works. So if they are all equally likely then the probability for each of them would be 1/(M+1), or if this is related to the toilet flushing thing you might be able to use the formula I derived to find your theoretical probabilities.
Once you have the probability E_{i} for each of i={0,1,2,...,M} then use your data to find the observed probability O_{i} of each of {0, ..., M} happening
You can then calculate the figure
And the last figure you need, n, is equal to your sample size.
Hi thanks for the reply!
"So I hope that clears up why a c.i. for the variance would be suitable for the mean. Unlike the median being robust I don't think the variance has any advantage over the mean except as you said in your case the mean is biased."
I can't find the line were you are talking about c.i. for the variance, by measuring the variance ?
Yes the mean is biased, that's very important. maybe I wasn't clear about "the why" to this. This is a real world problem I'm trying to solve. I'm interested in estimating mean by variance because the signal is severly biased, that is, an unknown konstant value is added to the signal. The variable should follow a poisson or binomial distr, which ever one prefer.
okey, say i have 10 persons that flushes during 6 timeslots (10min) during one hour. Then p would be 11. So then I should use Chi2 11 right, what value should I use for Chi2 11? Then It's said I should divide with n-p = n-11, what is n? "Pearson’s chi-squared statistic divided by its degrees of freedom" if d.f is 11 then n should be 22, I don't get this. Hope you can help! Again what is n, and what value do i use for Chi2 11?
Edit: I believe I missinterpret the d.f in the denominator as being the same as for the CHi2, I have another view of it in the next post.
Thanks for you input! Can you tell me tell me if my reasoning about how this relate to the Pearson chi2 is right?
if I have 6 timeslots and 11 possible "groups" then I get negative value, I have more parameters than "equations" therefore it can not be solved, is that correct? I need more slots.
1) How many would be prefereble, enough for a decent estimation? If I get 60 timeslots (10 persons) then I should divide Chi211 by n-p = 60-11=49.
1b) What about if I have a lot of zeroes? (As I could and would have in some cases for a binomial) Do I really have proportionally more information about my distribution, as the division by n-p would suggest. Or does the X11 calculation fully compensate for this? Say I have 10 people flushing during 60min. At least 50 of these slots would be zeroes.
2) If I look up Ch11 in a table how do I know what value to use?
http://sites.stat.psu.edu/~mga/401/t...uare-table.pdf
I guess for now, I don't care about the associated probability and just take the value X^2 from the sum you showed and divide it by 49 and then I'm done with Pearson chi2 ?
3) How would I use this Pearson? If Pearson would e.g. be 1,2 then I should divide my estimation of the mean by variance by 1,2 to get a "truer" value of the mean?
4) Lets say I have 60 timeslots each night and I have 30 nights in a month. Should I do this Pearson value for each night, somewhat different every night, correct each estimate of mean by variance, with pearson, that would be 30 pearson calculations over a month period?
Thanks again for your input Shakarri!
The example of using a c.i. for the mean to get a c.i. for the median was meant to show that you can use a c.i. of the variance as a c.i. for the mean.
I really like this application, before this I never saw a need to have a c.i. for the variance but it is really smart to use the c.i. of the variance in your case since an added constant would not affect the varianceI'm interested in estimating mean by variance because the signal is severly biased, that is, an unknown konstant value is added to the signal
n is not the number of timeslots, n is the number of samples you use when estimating the variance. When you test how many flushes are in a 10 minute time interval several times and calculate the sample variance you are using a sample size of n to do that.if I have 6 timeslots and 11 possible "groups" then I get negative value,
Yeas you are correct, the probability doesn't matter for this calculation.I guess for now, I don't care about the associated probability and just take the value X^2 from the sum you showed and divide it by 49 and then I'm done with Pearson chi2 ?
There is a bit of a problem if the severe bias you spoke of is increasing the number of flushes by a constant amount then you cannot calculate . If this is the case then you could either assume that there is no over-dispersion so that and variance=mean. Or if you took enough samples you could find what the constant is, so if you never saw fewer than 4 flushes then the constant might be 4 (and you can calculate the chance that it is less than 4)
I think the rest of your questions all come from the mix-up about what n is.
Why would that be proof saying anything about the c.i. of variance? If you think about it, mean and median are very closely related, indeed if the population is not "screwed" the result is interchangeable. But, the variance is taken from the numbers squared, so the generalisation i feel would be very bold. If the solution is that easy why are they not telling that in the article. Simply "the c.i. of variance is always the same as c.i for mean in a poisson" Could you help me find a believable article saying that?
One "proof" that saying that could be wrong
If we just for one moment compere to ci of Norm-distr. Then I guess you know that the c.i of mean is T-distr, that is, symetrical Norm-alike and becoming more or less a Norm if the no samples is 30 or more. Now, if we look at the c.i. of variance it is as you know X^2 and not symetrical at all ! But can be said to be Norm-alike and becoming more or less a Norm if the no samples is 50 or more. (Compare the numbers 30 and 50, I believe this is an indication on ci of Var being greater.)
Now in a the poisson; If λ=np is greater than about 10, then the normal distribution is a good approximation. So for np>10 the ci for mean and variance are not the same.
Futhermore let np=8 to show some true "Poissonic" behaivoir then the ci of the X^2 would not hold true, but the basic shape of the X^2 would be intact.
Back to my original question
------------------------------------------
Hi!
If X is normal distributed then Variance is X^2 distributed.
and if nr of samples is k>50 then it is approximately Normal distr.
Now, what if X is Poisson distributed (or Binom)
What could be said about the distributin of the variance?
And how many samples k is enough to be approximately Normal distr?
---------------------------
The "objections" I hade before still hold true (in my world)
"One "evidence" on that is you first have to calculate mean to get the variance. That generally reduces the d.f. hence you have fewer values calculating variance than calculating mean. (When you calculate mean you doesn't have to calculate variance, generating the mean) "
the variance could be in some sence thought of as noise (though true mesasurements) and it's hard to believe that measuring the noise should generate more or less exactly the same certenty in results as measuring the mean directly.
"I really like this application, before this I never saw a need to have a c.i. for the variance but it is really smart to use the c.i. of the variance in your case since an added constant would not affect the variance "
Thanks I thought I was on to something here, appreciate the help!
The very reason I want to measure mean by variance is that signal is biased.
In other case I would measure mean directly, would be quite more straight forward.
And, I want to know how shure I can be of my mean estimation by variance, thus c.i. of variance
----------------
"n is not the number of timeslots, n is the number of samples you use when estimating the variance. When you test how many flushes are in a 10 minute time interval several times and calculate the sample variance you are using a sample size of n to do that."
I prefere to look at it as
n is always the number of timeslots - one sample per timeslot
Sometimes I dont get to sample every minute in some areas.
In fact, the timeslots are directly given by the sample-rate.
If i get 6 samples per hour my timeslots are 10min per definition. If i get 60 samples per hour my timeslots are 1min.
so I guess my previous statement hold true:
if I have 6 timeslots and 11 possible "groups" then I get negative value, I have more parameters than "equations" therefore it can not be solved, is that correct? I would need more slots.
----------------------
"Yeas you are correct, the probability doesn't matter for this calculation.
There is a bit of a problem if the severe bias you spoke of is increasing the number of flushes by a constant amount then you cannot calculate . If this is the case then you could either assume that there is no over-dispersion so that and variance=mean. Or if you took enough samples you could find what the constant is, so if you never saw fewer than 4 flushes then the constant might be 4 (and you can calculate the chance that it is less than 4) "
That was bad news I think! If I find the constant, then I could subtract it and then do the Pearson X^2 ?
On a Poisson, What is over-dispersion really comming from, finding more high values weighting the tail than for the polulation? I feel it would be the opposite. or is it due to noise that squared get higher weight since that is usually not Zero centered ?
---------------------------------
I think the rest of your questions all come from the mix-up about what n is.
No I'm not sure about that.
I would very much appreciate if you look again now that no timeslots = no samples.
..if I have 6 timeslots and 11 possible "groups" then I get negative value, I have more parameters than "equations" therefore it can not be solved, is that correct? I need more slots.
1) How many would be prefereble, enough for a decent estimation? Are there any rule of thumb? If I get 60 timeslots (10 persons) then I should divide Chi211 by n-p = 60-11=49.
1b) What about if I have a lot of zeroes? (As I could and would have in some cases for a binomial) Do I really have proportionally more information about my distribution, as the division by n-p would suggest. Or does the X11 calculation fully compensate for this? Say I have 10 people flushing during 60min. At least 50 of these slots would be zeroes.
2) If I look up Ch11 in a table how do I know what value to use? Solved, thanks agan !
3) How would I use this Pearson? If Pearson would e.g. be 1,2 then I should divide my estimation of the mean by variance by 1,2 to get a "truer" value of the mean?
4) Lets say I have 60 timeslots each night and I have 30 nights in a month. Should I do this Pearson value for each night, somewhat different every night, correct each estimate of mean by variance, with pearson, that would be 30 pearson calculations over a month period?
Thanks again for your input Shakarri!