(Statistical newbie here - learning some basic concepts)
Some background (you can probably skip to the 'important bit', but just in case...): I am calculating the mean number of stratospheric sudden warmings (SSWs) per year. An SSW is a winter atmospheric event which happens roughly 6 times a decade. We don't have many of them on record. The dataset I have has 29 SSWs over 45 years. This gives a mean of 0.64 per year. For a given winter, there are zero events about half the time, one event about 40% of the time and two events about 10% of the time. But there is a lot of variability in this. For example, there were no SSWs at all between 1990 and 1998!
This means the sampling error on the estimate of the mean is quite large. I can calculate the standard error simply enough. I have a table going from the year 1958 to 2002, with the number of events in each year. A typical string the number of events per winter is: 2 1 0 2 1 1 2 0 1 0 0 0 0 0 1. I calculate the standard error of the mean by calculating the standard deviation and dividing it by the square root of the number of independent observations (the number of years).
So far so good.
* Important bit *: I understand how the standard error (SE) of the mean relates to the confidence interval, so that the 95% interval is approx. +/- 2*SE. All well and good. But depending on the dataset I use, I can obtain a standard error where this confidence interval crosses zero. I interpret this as an indication that the sampling error is so large we can't even be sure whether the true mean is positive!
The problem is, this is physical nonsense. We can't have a negative number of events per year. Realistically (for physical reasons that aren't too important) we will probably only ever see 0, 1, 2 or 3 events in each year.
Even though the estimate of the mean can be assumed to be normally distributed, we have an additional piece of information: that is must be positive. How do I work this into my confidence interval? Otherwise I have error bars crossing zero which makes no physical sense. Any advice would be much appreciated.
Attached plots to help:
ssws - this just shows the raw data, i.e. the number of SSW events in each year.
ssws2 - this shows the number of SSWs occurring in the period 1958-2002 split by month. I basically want to put a confidence interval on this bar chart.
ssws3 - this shows the frequency with which we observe 0, 1 annd 2 SSW events in a given year. I know it's only 3 data points, but it could be said to resemble a positive-only normal distribution.