Can I improve the accuracy of this method?

I hope it is OK to post here - I am a total non-expert with only basic maths ability, but with what I think is a tricky problem to solve. I hope someone here may be able to help.

I need to estimate the value of X (unknown) when I know only the value for Y. To do this I have a list of other occurrences of X and Y, and I have to determining any pattern in the correlation of X to Y. Although X and Y are not totally random, there is no definite or consistent correlation other than there is a tendency for the value of X to be larger if Y is larger. The best that can be hoped for is as accurate a guess as possible. Here is an example:

1. X = 380 Y = 320

2. X = 840 Y = 5500

3. X = 5176 Y = 499

4. X = 5315 Y = 1074

5. X = 6484 Y = 140

6. X = 7244 Y = 5002

7. X = 10950 Y = 1104

8. X = 11040 Y = 279

9. X = 16791 Y = 3001

10. X = 17096 Y = 2788

11. X = 20700 Y = 9194

12. X = 169901 Y = 20886

13. X = 178322 Y = 94437

14. X = 184739 Y = 1198

15. X = 525257 Y = 29129

X = ? if Y = 1198

X = ? if Y = 8062

X = ? if Y = 31419

As a non-expert, at the moment my best guess at the Xs involves the following process: I divide X by Y for all known values, in the example above this gives results ranging from 0.15 (No.2) to 154.21 (No. 14). I then find the median value across the sample set, in the example above this is 6.13 (represented by No. 10). I then apply this value to Y in the case of unknown X, so in this case I would estimate the 3 unknown Xs to be 7346 and 49420 and 192599.

I have two questions.

a) My simple method gets within +-50% in 60% of cases (the actual data sets are larger than the small example above, with anything from 50 to 500 known X and Ys). Is there a more advanced approach that would give more accurate estimates?

b) If I have to use a formula for estimating 100 unknown X's, and my goal is to get, say, 75% of my estimates to within, say, +-50%. As all values are positive, the most my result can be in error on the negative side is close to -100%, but my method can easily give errors on the positive side in excess of +100%, +200%, even 1,000%. So the scope for positive error is unlimited while the scope for negative error is limited to -100%. Can this fact be harnessed to improve the statistical accuracy and get more results within my target tolerance range?

For practical purposes any suggestions or solutions ideally need to be executable on Excel.

Thank you in advance for your patience with this problem and any help.

Re: Can I improve the accuracy of this method?

Quote:

Originally Posted by

**davidwhite** I hope it is OK to post here - I am a total non-expert with only basic maths ability, but with what I think is a tricky problem to solve. I hope someone here may be able to help.

I need to estimate the value of X (unknown) when I know only the value for Y. To do this I have a list of other occurrences of X and Y, and I have to determining any pattern in the correlation of X to Y. Although X and Y are not totally random, there is no definite or consistent correlation other than there is a tendency for the value of X to be larger if Y is larger. The best that can be hoped for is as accurate a guess as possible. Here is an example:

1. X = 380 Y = 320

2. X = 840 Y = 5500

3. X = 5176 Y = 499

4. X = 5315 Y = 1074

5. X = 6484 Y = 140

6. X = 7244 Y = 5002

7. X = 10950 Y = 1104

8. X = 11040 Y = 279

9. X = 16791 Y = 3001

10. X = 17096 Y = 2788

11. X = 20700 Y = 9194

12. X = 169901 Y = 20886

13. X = 178322 Y = 94437

14. X = 184739 Y = 1198

15. X = 525257 Y = 29129

X = ? if Y = 1198

X = ? if Y = 8062

X = ? if Y = 31419

As a non-expert, at the moment my best guess at the Xs involves the following process: I divide X by Y for all known values, in the example above this gives results ranging from 0.15 (No.2) to 154.21 (No. 14). I then find the median value across the sample set, in the example above this is 6.13 (represented by No. 10). I then apply this value to Y in the case of unknown X, so in this case I would estimate the 3 unknown Xs to be 7346 and 49420 and 192599.

I have two questions.

a) My simple method gets within +-50% in 60% of cases (the actual data sets are larger than the small example above, with anything from 50 to 500 known X and Ys). Is there a more advanced approach that would give more accurate estimates?

b) If I have to use a formula for estimating 100 unknown X's, and my goal is to get, say, 75% of my estimates to within, say, +-50%. As all values are positive, the most my result can be in error on the negative side is close to -100%, but my method can easily give errors on the positive side in excess of +100%, +200%, even 1,000%. So the scope for positive error is unlimited while the scope for negative error is limited to -100%. Can this fact be harnessed to improve the statistical accuracy and get more results within my target tolerance range?

For practical purposes any suggestions or solutions ideally need to be executable on Excel.

Thank you in advance for your patience with this problem and any help.

We would need to know more about the process producing these numbers, and would need to see a lot more data to offer any meaningful opinion.

CB

1 Attachment(s)

Re: Can I improve the accuracy of this method?

The X and Y values are asset types, and I can have data sets where, for example, I have 500 records in a data set. For 200 records both X and Y values may be known, for the remaining 300 records only Y is known. I'm trying to establish the most accurate method of estimating X in these cases.

I know there is no consistent relationship between the two values, but there is a tendency / probability (I use these terms as a layperson) for both to increase in size. I notice that X does not always increase with Y, so I suspect part of the problem may involve quantifying the probability for X and Y to both increase, as well as calculating the most probable proportion of one value to the other. Or maybe the approach required is totally different?

I have attached a larger data set of several hundred records as requested.

Re: Can I improve the accuracy of this method?

Quote:

Originally Posted by

**davidwhite** The X and Y values are asset types, and I can have data sets where, for example, I have 500 records in a data set. For 200 records both X and Y values may be known, for the remaining 300 records only Y is known. I'm trying to establish the most accurate method of estimating X in these cases.

I know there is no consistent relationship between the two values, but there is a tendency / probability (I use these terms as a layperson) for both to increase in size. I notice that X does not always increase with Y, so I suspect part of the problem may involve quantifying the probability for X and Y to both increase, as well as calculating the most probable proportion of one value to the other. Or maybe the approach required is totally different?

I have attached a larger data set of several hundred records as requested.

Look at a log-log plot of this data.

CB

1 Attachment(s)

Re: Can I improve the accuracy of this method?

I've produced a log-log of the example data set where X and Y values are both known. I have attached the log-log graph generated in Excel using base 10 (Excel default, it can be changed). It produces two Y axis, which is no doubt correct. I'm sure this is the approach and I can see that there is a clustering which appears to hint at potential order. But I have no idea how to use this data to produce an Excel formula which I can then apply to future known Y values to estimate unknown X values.

I understand there is a degree of 'figure it out yourself' appropriate for students of maths but I'm not - not even a CSE in maths 30 years ago! If this forum isn't the right place for non-mathematicians to get help can you point me in the right direction? Thanks.

Re: Can I improve the accuracy of this method?

Quote:

Originally Posted by

**davidwhite** I've produced a log-log of the example data set where X and Y values are both known. I have attached the log-log graph generated in Excel using base 10 (Excel default, it can be changed). It produces two Y axis, which is no doubt correct. I'm sure this is the approach and I can see that there is a clustering which appears to hint at potential order. But I have no idea how to use this data to produce an Excel formula which I can then apply to future known Y values to estimate unknown X values.

I understand there is a degree of 'figure it out yourself' appropriate for students of maths but I'm not - not even a CSE in maths 30 years ago! If this forum isn't the right place for non-mathematicians to get help can you point me in the right direction? Thanks.

The problem with a non-educational problem that may be used commercially is the MHF cannot do anything that might result in us being liable in any way for losses financial or otherwise of the advice.

But a linear realationship in a log-log plot implies a relationship of the form:

$\displaystyle Y=k X^{\alpha}$

where $\displaystyle k$ and $\displaystyle \alpha$ are constants.

CB