Consider this situation. There is an exam designed in such a way that it appears that the pass/failure rate of the exam has a linear relationship to the age of the exam taker. The older the test taker, the higher the pass rate. I'm not interested in the exact scores of the exam, only pass or fail.
I randomly sample 5,000 exam takers and record their age and pass/fail. Then I count the number of 35, 40, 53, etc year olds who passed and plot the pass rate percentage against age on a graph that looks something like this:
One notable fact about this survey is that not all ages have the same number of exam takers. For example there could be 300 35 year olds but only 15 65 year olds. Either way, I plot the pass rate for every age.
Let's say I am interested in calculating a linear regression line like one shown in the graph. Which of the following methods is will get the best regression line?
Method 1: Establish a minimum sample size threshold and exclude all data points that do not meet that minimum. For example, exclude data points that have fewer than 50 exam takers. If there were only 15 65 year olds, that data point is excluded because there is a high possibility that this specific point is inaccurate. All remaining data points are weighted equally and a regression line is calculated.
Method 2: Include all data points but weight them according to the sample size that each point is based on. If there are 300 35 year olds and 15 65 year olds, the 35 year old data point is weighted 20 times heavier than the 65 year old data point. Calculate a regression line using all the weighted data points.
Which method is better?