# Thread: Probability that a value belongs to a dataset

1. ## Probability that a value belongs to a dataset

G-day Math help forum, once again i find myself in need of your collective knowledge...

I'm utterly stuck on a problem, and have been for some time now, wherein I need to calculate the % probability that a new value belongs to a particular dataset. I only have the subject value and the values of the dataset. Obviously the closer to the mean the value to greater the probability however the change in probability with distance from the mean is not likely to be linear (i.e. a bell curve)
I've attempted to work it out using z values and standard deviations however i keep getting referrals to z value tables and calculator functions that the program I'm using simply doesn't have (Namely the field calculator in the attribute table of ESRI's ArcGIS).

if for example my dataset was:

2,5,7,6,4,5,6,4,5,6

a new value of 5 or 6 would be significantly more probable than 1 or 9

Does anyone have any idea's on this one? I'm sure I'm making it far more complicated than necessary but its got me stumped...

2. ## Re: Probability that a value belongs to a dataset

You can use prediction intervals to find this. Find the Z score that would mean the new value belongs to the data set, then convert the z score to a percentile about the mean.
If for example the Z score was 1, that corresponds to 68% of the population around the mean so tha chance your data point is part of the data set is 1-0.68= 32%
Prediction interval - Wikipedia, the free encyclopedia

3. ## Re: Probability that a value belongs to a dataset

I considered that and it seems ideal however this requires a z score table or an equivalent function and the field calculator doesn't have these, no one seems to be able to tell me how to calculate the standard normal properties without these.

4. ## Re: Probability that a value belongs to a dataset

I have no idea what you mean by a "field calculator". The only way to find a normal probability if you cannot look it up in a table or use a calculator that has a "normal probabilty function", is to do a numerical integration- $\int_a^b e^{-(x- \mu)^2/2\sigma} dx$.

5. ## Re: Probability that a value belongs to a dataset

Originally Posted by Mattrnfnr
G-day Math help forum, once again i find myself in need of your collective knowledge...

I'm utterly stuck on a problem, and have been for some time now, wherein I need to calculate the % probability that a new value belongs to a particular dataset. I only have the subject value and the values of the dataset. Obviously the closer to the mean the value to greater the probability however the change in probability with distance from the mean is not likely to be linear (i.e. a bell curve)
I've attempted to work it out using z values and standard deviations however i keep getting referrals to z value tables and calculator functions that the program I'm using simply doesn't have (Namely the field calculator in the attribute table of ESRI's ArcGIS).

if for example my dataset was:

2,5,7,6,4,5,6,4,5,6

a new value of 5 or 6 would be significantly more probable than 1 or 9

Does anyone have any idea's on this one? I'm sure I'm making it far more complicated than necessary but its got me stumped...
What you ask cannot be done without more information and calculation than you are likely to be able to provide. What follows is a "desperate statistical" approach that will often be adequate.

The simplest approach to your real problem (of deciding if a new value is likely to be from the same distribution as your data set) is to set a pair of control limits. Typically we would set these to be $\pm 2 s$ from the mean (where $s$ is your estimate of the standard deviation of the distribution your data set is sampled from). Though as the value $2$ is in your test data and you have a small sample I would go with the slightly more conservative $\pm 2.3 s$ control limits (based on the t-distribution which is often used for small sample work even though not strictly applicable).

With these control limits you will reject about 5% of cases where the new value does come from the same distribution as your data set.