# Thread: Can it be justified to remove anomalous data points to give a better representation?

1. ## Can it be justified to remove anomalous data points to give a better representation?

Hey!

Ok so we carried out an experiment called the oral glucose tolerance test (OGTT) in which we had to fast for >6hours and then drink a large glucose load (50g). We then used a prick test to test the concentration of blood glucose every 30 minutes for 2 hours. Generally speaking, most peoples blood glucose rose and peaked at around 30/60 minutes. I THEN measured the rate of increase (between conc. at 0mins to max[conc]) and the rate of decrease (between max[conc] and the concentration 30 minutes after max[conc]) and produced the table below:

Code:
0.07222223	0.02666667
0.09333333	0.090000
0.2466667	0.040000
0.1416667	0.003333333
0.2766667	0.020000
0.2633333	0.05666667
0.3566667	0.150000
0.210000	0.060000
0.310000	0.010000
0.290000	0.01666667
0.2433333	0.050000
0.1766667	0.1133333
0.290000	0.02333333
This graphs out to what is in the attachment. (The y-axis is rate of blood glucose decreasing)

As you can see, there is a nice cluster of negatively correlated values! HOWEVER, around the plot there are some pretty anomalous values.
I suspect the values to the LEFT of the plot to be a result of individuals who did not like the taste of the glucose drink that was administered and therefore didnt finish it, otherwise their livers' would be considered super-human (a very unlikely reason!).
The extreme value to the RIGHT of the graph is anomalous biologically explainable.

The problem that these anomalous values cause is that they significantly affect the conclusions that can be drawn from the results.
Correlation analysis of this full set data shows no significant correlation between the two variables. However, if I remove the 3 most extreme data points (from the imagined line of best fit), then the correlation is significant.

This is for a lab report which forms a HUGE percentage of my final grade at university so it is important that I can produce some good results.

1) Is there a way to identify anomalous data?

2) Would it be scientifically acceptable to remove this data? If so, what reason could I give for removing it?

NOTE: I am using Excel, SPSS and GraphPrism for all the statistical analysis.

2. Originally Posted by anthmoo
Hey!

Ok so we carried out an experiment called the oral glucose tolerance test (OGTT) in which we had to fast for >6hours and then drink a large glucose load (50g). We then used a prick test to test the concentration of blood glucose every 30 minutes for 2 hours. Generally speaking, most peoples blood glucose rose and peaked at around 30/60 minutes. I THEN measured the rate of increase (between conc. at 0mins to max[conc]) and the rate of decrease (between max[conc] and the concentration 30 minutes after max[conc]) and produced the table below:

Code:
0.07222223	0.02666667
0.09333333	0.090000
0.2466667	0.040000
0.1416667	0.003333333
0.2766667	0.020000
0.2633333	0.05666667
0.3566667	0.150000
0.210000	0.060000
0.310000	0.010000
0.290000	0.01666667
0.2433333	0.050000
0.1766667	0.1133333
0.290000	0.02333333
This graphs out to what is in the attachment. (The y-axis is rate of blood glucose decreasing)

As you can see, there is a nice cluster of negatively correlated values! HOWEVER, around the plot there are some pretty anomalous values.
I suspect the values to the LEFT of the plot to be a result of individuals who did not like the taste of the glucose drink that was administered and therefore didnt finish it, otherwise their livers' would be considered super-human (a very unlikely reason!).
The extreme value to the RIGHT of the graph is anomalous biologically explainable.

The problem that these anomalous values cause is that they significantly affect the conclusions that can be drawn from the results.
Correlation analysis of this full set data shows no significant correlation between the two variables. However, if I remove the 3 most extreme data points (from the imagined line of best fit), then the correlation is significant.

This is for a lab report which forms a HUGE percentage of my final grade at university so it is important that I can produce some good results.

1) Is there a way to identify anomalous data?

2) Would it be scientifically acceptable to remove this data? If so, what reason could I give for removing it?

NOTE: I am using Excel, SPSS and GraphPrism for all the statistical analysis.

What you present here is pretty close to what you should put in your report.

Report the conclusion on the raw data and explain why you think that the outliers should be pruned and present the results etc after pruning and why these are to be preferred to the results with the raw data..

I doubt it is important that you produce "good results" rather than a good report. To report the reality is much better science that trimming the data to fit the desired conclusions. The alternative is to encourage scientific fraud.

CB

3. Thanks man! I remember you from here a few years back when I was doing my A Levels in Further Maths and Maths. You were a great help then and you still are thank you, you do great work here!