Where I work we have a database job that does some accounting for us. Due to the way that this program is written we think that as we get further into the year the job will slow down and then revert to normal speeds at the start of the new year.
We have about 200 observations. I started looking at the data and the first thing that jumped out at me was that the more work there was to do the faster it went. So I made a graph that showed the number of records that needed to be processed during a single execution vs. the rate at which records were processed. There is a definite linear correlation between these two variables. I can easily use a least squares method to find the line that fits this data the best.
My problem is that I need to compare the program executions on the basis of when during the year they were run not the amount of work to be processed at once.
What I would like to do is compute the least squares line and use that to normalize the observations then make a new graph that plots the number of days into the year versus the normalized speed. This should show any potential slowdowns.
My question is: Is this a valid thing to do? Can I produce an "apples to apples" comparison by factoring out the batch size on the observations.
A co-worker says I can't do this because if there is an effect of the calendar date then it is already distorting my first graph. I think that since the batch size and calendar date are independent variables I can do this and get a meaningful result.
Is this clear? Can anyone tell me what this kind of analysis is called?
I only think they two variables are independent because they are both selected arbitrarily by our operations people but the truth is there might be some subtle relationship however unintentional.
I will look into the multi-variable linear regression. Thanks for the reply.
What if I back up and first try and verify that the date in the year and the number of records are in fact independent? Is it enough to show that they are not correlated?
If that were true then would I be justified in normalizing the data in my first graph the way I described above?
I'm a novice in statistics so let me make sure I understand what you are saying.
Conceptually I will be plotting my points in 3-D space. The x-axis can be the batch size, the y-axis the days into the year and the z-axis is the rate at which the jobs ran. This will look like a lot of points floating above the first quadrant. Then my task with the linear regression involving 2 variables is to draw a plane through the points that minimizes the squares of the distances from the plane to the points.
I can look up the details on that. So then will I be able to see that jobs run later in the year take longer even controlling for batch size? Will it be clear by the slope of the plane in one of the directions? Should I scale the ranges of the 2 variables to produce a square domain?
Also by plotting (response corrected for batch size) against should give you a plot showing what you want.
I think I see what you mean. I do think that in my original post I was thinking of the non-linear version that you mention where the variables are multiplied together. I say this because I wanted to divide the response R by the value of the regression line when I consider only batch size to R ( I think you called it k2*N ). So instead of plotting R - k2*N vs. t ... I wanted to plot R / k2*N vs. t.
Do you think that is valid as well? Is there anything you can point me to on the difference between linear and non-linear regressions?