Then use the max of either "real" I or "reported" I.
I don't get it. Can you describe with details? Can this satisfy all these conditions? :
I=10,000 E=1000 P=9,000 R=20,000
or
I=10,000 E=9,000 P=1,000 R=2,000
or
I=5,000 E=10,000 P=-5,000 R=2,000
or
I=5,000 E=10,000 P=-5,000 R=-2,000
or
I=5,000 E=5,000 P=0 R=2,000
or
I=-5,000 E=10,000 P=-5,000 R=0 (very rare)
thanks.
At this point the only thing I can offer is if |(R-P)/I| is greater than 1, report it as 1, and don't worry about its value relative to others that are greater than 1. Any score > 1 requires a significant amount of untruthfulness. Or alternatively - for the denominator use the maximum value of I, E and |R|. For the data you gave previously this would yield:
I=10,000 E=1000 P=9,000 R=20,000: score = 11000/20000 = 55%
or
I=10,000 E=9,000 P=1,000 R=2,000: score = 1000/10000 = 10%
or
I=5,000 E=10,000 P=-5,000 R=2,000; Score = 7000/10000 = 70%
or
I=5,000 E=10,000 P=-5,000 R=-2,000: score = 3000/10000 = 30%
or
I=5,000 E=5,000 P=0 R=2,000: score = 2000/5000 = 40%
or
I=-5,000 E=10,000 P=-5,000 R=0 (very rare): How can I be negative? If that is not a typo then score = 5000/10000 = 50%
I think for your second solution about use the maximum value of I, E and |R| for denominator, In all cases output value is between 0 and 1 but with different nature of criteria. In first one income is a base for calculating, in second one E is a base (both are proposed values from households) and finally in last one we use a value that is from investigating of company. I think that is true that all outputs are between 0 and 1 but the nature of these formulas are different. what do you think?
I think you have an impossible task. Given the lack of data (no I_real nor E_real) you don't have enough data to consistently derive a score between 0 and 1.
You haven't answered one of my previous questions - why do you say that R>=P? Can't you have the case P = +1000 and R = -1000?
No . We cant have the case P = +1000 and R = -1000. Only for instance have this cases :
P=1000, R=2000
P=0, R =2000
P=-1000 R=2000
P=-1000 R=0
P=-1000 R = -500
In last case Household propose that have -1000$ loss but out company find that he has lower loss (-500) but not positive net budget. So in all cases R>=P . As a result you think R-P/I is a better in all cases but the only problem of it is unlimited positive boundary. Is this true?
Thank you again.
Because the proposed value is a basic for investigators. They must find that how much money a household is hiding and increase the profit of company. unfortunately this a usual procedure that companies are trying to gain more profit only. Investigators are searching for dishonesty and increasing the net budget. Decreasing the proposed value isn't accepted for managers and policy of firm.
OK, so if they find R<P they make R=P; got it.
Next question - in your view which score should be higher (closer to 1):
(a) P=1, R=2, or
(b) P=10000, R= 20000
or should they be the same, and why?
That's a good and critical question Ebaines. I mentioned and you know that I'm creating a system for finding households that are hiding something from us So I think (b) is better in my case. As you said it is impossible to find a criteria that satisfy all conditions of my problem so my first goal is find a general criteria for all conditions (as I mentioned for your second proposed criteria sometimes we were using I, sometimes E and other times abs(R) ) so I think it is better to have a specific criteria that have a strong basic and construction.
The other approach is that the importance weight of your two examples (a and b) is the same because (a) has 50% hiding in that specific values (1 and 2) and also (b) has 50% hiding in the specific values of P=10,000 and R=20,000 and based on the household size.
Eventually I must say that after designing the formula I will insert it in my system and train system with it. After that we can more conversations about the better criteria but know we can create and improve these formulas.
Thanks.
PS.
Have more focus on situation (b) and it's higher values for hiding because of higher amount of money that he is hiding.
If you're willing to live with my above example having the same score, then this could do it:
1. First determine a value of relative difference 'D' between R and P, using D=(R-P)/(|R|), except if R = 0 then set the denominator to 1 (I'm assuming that values for R are typically not single digit). This value for D is always positive or zero, and can range from 0 to infinity.
2. Apply the following transformation: score = D/(D+1). This "compresses" the D value to a range between 0 and 1.
3. You may want to consider a slight change - a "tuning" variable K so that score = KD/(KD+1). You might try K = 0.5, or 0.1 and see how it behaves.
Hope this helps. You need to try a bunch of values and see if it meets your objectives.
That is a good approach. First thank you for your helps. I Have some questions.
1. for 'D' We have values like 1000 and values like 0.2 but a lot of 'D' values are between 0 and for example 20. I think these high values effects on 'D' and finally on 'Score'. What is your opinion?
2. the lower bound of score function is 0 but the upper bound is lower than 1. Is this true? D/(D+1) isn't a mapping function between [0 1] because as you know in mapping function we find minimum and maximum of all 'D's and set them to 0 and 1. Can you describe the score function with more details?
3.In second phase what is purpose of using 'K'?
4. When I checked scatter chart of 'Score' Values (phase 2). Values near 0.5 have more density. What do you think about this? This is scatter chart :
Attachment 31108
Thanks again and again :-)
Note sure what you mean, but if you're asking whether a high value of D leads to a high score (closer to 1), then yes.
No - the limit of score as D goes to infinity is 1.
It's simply a function that transforms the value of D (which can range from 0 to infinity) to "score" with values from 0 to 1.
Depending on the data for R and P you may find that too many of the values for "score" are too close to 1. The 'K' value serves to lower their values a bit. It may not be needed, but given how the D value is so highly influenced by the denominator |R|, for small values of |R| D can become quite large. This factor can help make these large values a bit less.
Please explain what this is a scatter chart of - I assume the vertical axis is "score," but what is the horizontal axis?