
Originally Posted by
davidwhite
I hope it is OK to post here - I am a total non-expert with only basic maths ability, but with what I think is a tricky problem to solve. I hope someone here may be able to help.
I need to estimate the value of X (unknown) when I know only the value for Y. To do this I have a list of other occurrences of X and Y, and I have to determining any pattern in the correlation of X to Y. Although X and Y are not totally random, there is no definite or consistent correlation other than there is a tendency for the value of X to be larger if Y is larger. The best that can be hoped for is as accurate a guess as possible. Here is an example:
1. X = 380 Y = 320
2. X = 840 Y = 5500
3. X = 5176 Y = 499
4. X = 5315 Y = 1074
5. X = 6484 Y = 140
6. X = 7244 Y = 5002
7. X = 10950 Y = 1104
8. X = 11040 Y = 279
9. X = 16791 Y = 3001
10. X = 17096 Y = 2788
11. X = 20700 Y = 9194
12. X = 169901 Y = 20886
13. X = 178322 Y = 94437
14. X = 184739 Y = 1198
15. X = 525257 Y = 29129
X = ? if Y = 1198
X = ? if Y = 8062
X = ? if Y = 31419
As a non-expert, at the moment my best guess at the Xs involves the following process: I divide X by Y for all known values, in the example above this gives results ranging from 0.15 (No.2) to 154.21 (No. 14). I then find the median value across the sample set, in the example above this is 6.13 (represented by No. 10). I then apply this value to Y in the case of unknown X, so in this case I would estimate the 3 unknown Xs to be 7346 and 49420 and 192599.
I have two questions.
a) My simple method gets within +-50% in 60% of cases (the actual data sets are larger than the small example above, with anything from 50 to 500 known X and Ys). Is there a more advanced approach that would give more accurate estimates?
b) If I have to use a formula for estimating 100 unknown X's, and my goal is to get, say, 75% of my estimates to within, say, +-50%. As all values are positive, the most my result can be in error on the negative side is close to -100%, but my method can easily give errors on the positive side in excess of +100%, +200%, even 1,000%. So the scope for positive error is unlimited while the scope for negative error is limited to -100%. Can this fact be harnessed to improve the statistical accuracy and get more results within my target tolerance range?
For practical purposes any suggestions or solutions ideally need to be executable on Excel.
Thank you in advance for your patience with this problem and any help.