# Thread: Discerning the validity of outliers in a data set.

1. ## Discerning the validity of outliers in a data set.

Hi all,

I hope this is the right forum, it is certainly my best guess.

My problem is with a data set (I may come here with a lot of questions based on the research I am trying to do, so I will say this is the first).

Say I have a data set of 30 values. The first 28 values (for simplicity; the order in which they were received is random) all fall between 10 and 14, the last two values are greater than 200. I know these last two values are mistakes, and I think if you handed anyone this data set they could discern that the values seemed very off, but how do I prove it mathematically? Furthermore how do I get an average of the values, while putting as little weight as possible on the extreme outliers?

Thank You muchly.

2. Outliers have a fencing they fall out side of, this is how you would prove the values are outliers mathematically.

If you know they are mistakes discard them and calculate your average. Find the median and the mean, which one is the better indicator?

3. Originally Posted by ballofpopculture
Hi all,

I hope this is the right forum, it is certainly my best guess.

My problem is with a data set (I may come here with a lot of questions based on the research I am trying to do, so I will say this is the first).

Say I have a data set of 30 values. The first 28 values (for simplicity; the order in which they were received is random) all fall between 10 and 14, the last two values are greater than 200. I know these last two values are mistakes, and I think if you handed anyone this data set they could discern that the values seemed very off, but how do I prove it mathematically? Furthermore how do I get an average of the values, while putting as little weight as possible on the extreme outliers?

Thank You muchly.
One very simple quantitative way of doing this is to calculate the quartiles Q1, Q2 and Q3 and then the Inter-Quartile Range (IQR) Q3 - Q1. Outliers are data that lies outside the interval [Q1 - 1.5 IQR, Q3 + 1.5IQR].

Other ways are possible using the mean and sd.

4. I think you both have given me enough to go on, but I'll add:

-The data set was an example I've seen while I have been collecting data, but the amount of data is so great that I can't scan each set for outliers by eye (I'm using some perl scripts to do it). The actual values and the difference between "acceptable" and "unacceptable" is sometimes closer than the difference between 10 to 14 and 200.

-When I say outliers here I meant it strictly in the sense that certain values seem outside the data set, I have not calculated these outliers, only assumed that they are such (as seen in the above example).

Thank you both.