Hi!
Given a binomial distribution (p and n are known), does it make sense to construct a confidence interval around expected mean?
I guess my question is more about confidence intervals in general: usually we have a sample and estimate a population parameter based on that sample. In this case each (other) sample might lead to a different estimation and thus we use the confidence interval (and level) to communicate the uncertainty.
In my case, however, I do not have a sample to estimate the mean value from. I have concrete distribution parameters (p and n) and I "predict" that (intuitively) in most cases there will be p*n successful outcomes.
I am not sure how I can apply confidence intervals here. Will my confidence interval be simply defined by the area under the distribution function, which is equal to the confidence level?
Please note that I am *not* estimating the value of 'p' from a sample (this is the topic that would be most probably brought up if you search for "binomial distribution confidence interval" on the web).
Thanks!
Ok, here is what I'm dealing with.
I've derived a model to explain some real data; the model is not a least square fit or whatever, it's purely theoretic. In the end of the day my model function M(x) is about to "predict" a value of a random variable X. M(x) happens to be an expected mean value of a binomial distribution with parameters p(x) and n(x), e.g.
M(x) = p(x) * n(x)
(note that for each x the binomial distribution parameters may be different -- p and n are functions)
Now, for regression models it's expected that a prediction is accompanied with a prediction (wide) confidence interval, which tells the consumer of the predicted data what's my level of certainty in the numbers.
In my case I'm not sure how to attach such knowledge to the prediction.
...
As a background information, here is what I'm really doing. I'm given a large set of data (it's a large corpus of text actually). Given this data we can observe three functions, let's say D(x), E(x) and F(x). My theory is that if D(x) and E(x) are known, F(x) can be constructed. The model function M(x) I'm talking about above is the theoretical F(x). M(x) is constructed based on values of D(x) and E(x) plus some extra parameters; thus in fact I'm able to create a class of M(x) functions and my ultimate goal is to judge which one is the closest fit to real data (F(x)).
I *think* that chi-square test is not appropriate here for the following reasons (am I wrong?):
* I'm not testing if a sample fits a distribution: F(x) is not a distribution function.
* Using different corpus would yield completely different F(x) and M(x) so I can't, say, take N corpora, build M(x) for each of them and use M(x) values as sample values to see if they fit the "distribution" defined by F(x).
...
Yet more details. :-)
Say x = 1..5 for a particular corpus.
We then have 5 values of F(x) and 5 values of M(x). For each datapoint x, the value M(x) is distributed binomially with parameters p_x, n_x. The mean is p_x * n_x and now we need to check if the actual value F(x) matches our theoretical distribution *for this value of x*.
So for each point x, I would be able to test whether a sample of real values (e.g. F(x)) matches my theoretical distribution M(x) for this very point. The problem is that I only have 1 value of F(x) thus it doesn't make sense to speak about the distribution. That's one problem.
Another problem is how do I use all 5 datapoints in my analysis.
Hopefully that makes the issue a bit clearer.