# Can it be justified to remove anomalous data points to give a better representation?

• Dec 12th 2010, 09:37 AM
anthmoo
Can it be justified to remove anomalous data points to give a better representation?
Hey!

Ok so we carried out an experiment called the oral glucose tolerance test (OGTT) in which we had to fast for >6hours and then drink a large glucose load (50g). We then used a prick test to test the concentration of blood glucose every 30 minutes for 2 hours. Generally speaking, most peoples blood glucose rose and peaked at around 30/60 minutes. I THEN measured the rate of increase (between conc. at 0mins to max[conc]) and the rate of decrease (between max[conc] and the concentration 30 minutes after max[conc]) and produced the table below:

Code:

```0.07222223        0.02666667 0.09333333        0.090000 0.2466667        0.040000 0.1416667        0.003333333 0.2766667        0.020000 0.2633333        0.05666667 0.3566667        0.150000 0.210000        0.060000 0.310000        0.010000 0.290000        0.01666667 0.2433333        0.050000 0.1766667        0.1133333 0.290000        0.02333333```
This graphs out to what is in the attachment. (The y-axis is rate of blood glucose decreasing)

As you can see, there is a nice cluster of negatively correlated values! HOWEVER, around the plot there are some pretty anomalous values.
I suspect the values to the LEFT of the plot to be a result of individuals who did not like the taste of the glucose drink that was administered and therefore didnt finish it, otherwise their livers' would be considered super-human (a very unlikely reason!).
The extreme value to the RIGHT of the graph is anomalous biologically explainable.

The problem that these anomalous values cause is that they significantly affect the conclusions that can be drawn from the results.
Correlation analysis of this full set data shows no significant correlation between the two variables. However, if I remove the 3 most extreme data points (from the imagined line of best fit), then the correlation is significant.

This is for a lab report which forms a HUGE percentage of my final grade at university so it is important that I can produce some good results.

1) Is there a way to identify anomalous data?

2) Would it be scientifically acceptable to remove this data? If so, what reason could I give for removing it?

NOTE: I am using Excel, SPSS and GraphPrism for all the statistical analysis.

• Dec 12th 2010, 10:53 AM
CaptainBlack
Quote:

Originally Posted by anthmoo
Hey!

Ok so we carried out an experiment called the oral glucose tolerance test (OGTT) in which we had to fast for >6hours and then drink a large glucose load (50g). We then used a prick test to test the concentration of blood glucose every 30 minutes for 2 hours. Generally speaking, most peoples blood glucose rose and peaked at around 30/60 minutes. I THEN measured the rate of increase (between conc. at 0mins to max[conc]) and the rate of decrease (between max[conc] and the concentration 30 minutes after max[conc]) and produced the table below:

Code:

```0.07222223        0.02666667 0.09333333        0.090000 0.2466667        0.040000 0.1416667        0.003333333 0.2766667        0.020000 0.2633333        0.05666667 0.3566667        0.150000 0.210000        0.060000 0.310000        0.010000 0.290000        0.01666667 0.2433333        0.050000 0.1766667        0.1133333 0.290000        0.02333333```
This graphs out to what is in the attachment. (The y-axis is rate of blood glucose decreasing)

As you can see, there is a nice cluster of negatively correlated values! HOWEVER, around the plot there are some pretty anomalous values.
I suspect the values to the LEFT of the plot to be a result of individuals who did not like the taste of the glucose drink that was administered and therefore didnt finish it, otherwise their livers' would be considered super-human (a very unlikely reason!).
The extreme value to the RIGHT of the graph is anomalous biologically explainable.

The problem that these anomalous values cause is that they significantly affect the conclusions that can be drawn from the results.
Correlation analysis of this full set data shows no significant correlation between the two variables. However, if I remove the 3 most extreme data points (from the imagined line of best fit), then the correlation is significant.

This is for a lab report which forms a HUGE percentage of my final grade at university so it is important that I can produce some good results.

1) Is there a way to identify anomalous data?

2) Would it be scientifically acceptable to remove this data? If so, what reason could I give for removing it?

NOTE: I am using Excel, SPSS and GraphPrism for all the statistical analysis.