Results 1 to 3 of 3

Math Help - A Better Linear Regression

  1. #1
    Newbie
    Joined
    Dec 2009
    Posts
    2

    A Better Linear Regression

    I am trying to regress y=3x with one bad data point.
    x = 0,2,4,6,8 y=0,6,12,18,8

    The linear regression is y=1.4x + 3.2 and it properly minimizes the least square error, but I would like to find a regression that better fits the real data and gives less emphasis to the bad data (x=8). In the real world, it is not known which is the bad data, so weighting is not an acceptable option.

    If I look at the area between the curves, that area appears to be minimized by fitting through the correct data which is the answer I would be looking for.

    Can anyone refer me to regregression equations (and derivation would be great) that minimizes the area between a linear fit and the experimental data (x independent)? I want to learn how to do this, so a MatLab solution does not help.

    Thanks for your time!
    Follow Math Help Forum on Facebook and Google+

  2. #2
    Banned
    Joined
    Sep 2009
    Posts
    502
    Quote Originally Posted by TheWizard View Post
    I am trying to regress y=3x with one bad data point.
    x = 0,2,4,6,8 y=0,6,12,18,8

    The linear regression is y=1.4x + 3.2 and it properly minimizes the least square error, but I would like to find a regression that better fits the real data and gives less emphasis to the bad data (x=8). In the real world, it is not known which is the bad data, so weighting is not an acceptable option.

    If I look at the area between the curves, that area appears to be minimized by fitting through the correct data which is the answer I would be looking for.

    Can anyone refer me to regregression equations (and derivation would be great) that minimizes the area between a linear fit and the experimental data (x independent)? I want to learn how to do this, so a MatLab solution does not help.

    Thanks for your time!
    In real life no one will regress from a give equation and end up with one bad number. If one ended with a bad number and force it to fit, it will give nothing but false information.

    Suppose that the number you obtained from experiment is x = 0,2,4,6,8 y=0,6,12,18,8 and you know that they are close enough as anticipated based on your hypothesis, then you should have a good feel for which equation to fit.

    Suppose that you did your experiment with no clue about the outcome, and you got x = 0,2,4,6,8 y=0,6,12,18,8. Then you must first graph you scattered diagram. If the diagram tells you that it is close to a straight line, you use straight-line equation. To confirm the goodness of fit, you find the coefficient of correlation. If the coefficient of correlation is too poor, you can go to next degree curve. You go to fit the quadratic curve. If quadratic curve’s coefficient of correlation is still poor, you can to up the next degree, i.e. the Cubic curve. Segue; you can go up to the most exotic Logistic curve.
    Follow Math Help Forum on Facebook and Google+

  3. #3
    Newbie
    Joined
    Dec 2009
    Posts
    2

    Thumbs down Reply to Novice

    Thank you for thinking about my problem. Let me give you some additional background.

    The real world example is fitting oil log(production rate) versus time. This is not a controlled experiment. Most of the data may be considered similar to a controlled experiment because the operating conditions will be similar. Occasionally the operating conditions will change for a period of time, creating data that does not fit the normal trend. For this reason, it is necessary to detect this "bad data" and exclude it from the fit. In most cases, the data is analysed without knowledge of the daily operations. This means that it is known that "bad data" may exist but the specific data is not identified. The desired solution is an automatic fit without user intervention.

    The example provided is contrived to emphasize the problem. Your comments regarding the search for a higher order solution are understood, but we are theoretically restirction to a linear (transformed) solution.

    If one considers the error to be the volume of oil produced (integral of rate versus time), then one would choose to minimize the area between the best fit line and a line connecting the data. I have done this manually for the sample data and find that this revised error minimization results in a best fit line that passes through all of the good data, but not the "bad data". This is an exceptionally favorable outcome because it automatically detects and removes the bad data from the fit.

    The problem is that I cannot figure a way to formulate the area minimization. One could argue integrating the theoretical forumulation to plot rate versus cumulative production, but then both variables are dependent and a bad data point will influence the balance of the data.
    Follow Math Help Forum on Facebook and Google+

Similar Math Help Forum Discussions

  1. perpendicular linear span (linear regression)
    Posted in the Advanced Algebra Forum
    Replies: 0
    Last Post: October 22nd 2011, 03:56 PM
  2. Regression fundamentals (linear, non linear, simple, multiple)
    Posted in the Advanced Statistics Forum
    Replies: 13
    Last Post: November 12th 2009, 06:47 AM
  3. Linear Regression?
    Posted in the Advanced Statistics Forum
    Replies: 5
    Last Post: June 30th 2009, 02:12 PM
  4. Regression SS in multiple linear regression
    Posted in the Advanced Statistics Forum
    Replies: 0
    Last Post: June 20th 2009, 12:23 PM
  5. Linear regression
    Posted in the Advanced Statistics Forum
    Replies: 2
    Last Post: December 31st 2007, 03:23 AM

Search Tags


/mathhelpforum @mathhelpforum