# Thread: Finding criteria for a household financial budget falsification

1. ## Re: Finding criteria for a household financial budget falsification

Originally Posted by ebaines
Note sure what you mean, but if you're asking whether a high value of D leads to a high score (closer to 1), then yes.

No - the limit of score as D goes to infinity is 1.

It's simply a function that transforms the value of D (which can range from 0 to infinity) to "score" with values from 0 to 1.

Depending on the data for R and P you may find that too many of the values for "score" are too close to 1. The 'K' value serves to lower their values a bit. It may not be needed, but given how the D value is so highly influenced by the denominator |R|, for small values of |R| D can become quite large. This factor can help make these large values a bit less.

Please explain what this is a scatter chart of - I assume the vertical axis is "score," but what is the horizontal axis?
Thank you Ebaines. For last question vertical axis is Score as you said and horizontal axis is samples (different households). You mentioned that we use "k" in cases that too many of the values for "score" are too close to 1 but why here as you can see in scatter graph a lot of samples are near 0.5 !? another question.

another question :
You used absolute value on 'R' in Denominator. It is true when 'R' is positive but suppose 'R' is negative like these cases :

case 1 : P=1000, R=2000
case 2 : P=-1000 R=-500
.
In case 2 we have 500 in denominator so we have different value from real value of R=-500 here. you see we have difference of 500-(-500)=1000 here with real value of R. Isn't this a problem for the created criteria?

2. ## Re: Finding criteria for a household financial budget falsification

Originally Posted by jacks12
For last question vertical axis is Score as you said and horizontal axis is samples (different households).
I was asking what the values on the x-axis represent. Is it P, or R? Or something else?

Originally Posted by jacks12
here as you can see in scatter graph a lot of samples are near 0.5 !?
It means that many of the households are reporting positive values of P that are about 50% of R, or losses for P that are about 50% greater than R.

Originally Posted by jacks12
you see we have difference of 500-(-500)=1000 here with real value of R. Isn't this a problem for the created criteria?
Not an issue. The intent of the absolute value in the denominator is to make the D calculation always positive. Take these examples :

1. P = 500, R = 1000 yields D = (1000-500)/500 = 0.5
2. P = -1500, R = -1000 yields D = (-1500-(-1000))/1000 = 0.5

In both cases since the difference (R-P) is the same and R is the same magnitude you get the same value for D, which is what we want. Maybe it would look more appealing if I wrote it as D = |(R-P)/R|, but mathematically it's the same thing since R-P can never be negative.

3. ## Re: Finding criteria for a household financial budget falsification

Originally Posted by ebaines
I was asking what the values on the x-axis represent. Is it P, or R? Or something else?

It means that many of the households are reporting positive values of P that are about 50% of R, or losses for P that are about 50% greater than R.

Not an issue. The intent of the absolute value in the denominator is to make the D calculation always positive. Take these examples :

1. P = 500, R = 1000 yields D = (1000-500)/500 = 0.5
2. P = -1500, R = -1000 yields D = (-1500-(-1000))/1000 = 0.5

In both cases since the difference (R-P) is the same and R is the same magnitude you get the same value for D, which is what we want. Maybe it would look more appealing if I wrote it as D = |(R-P)/R|, but mathematically it's the same thing since R-P can never be negative.
for your question : I checked the criteria for my samples. x-axis is household index (1,2,3,4,...) and y-axis is calculated criteria.

4. ## Re: Finding criteria for a household financial budget falsification

Originally Posted by jacks12
for your question : I checked the criteria for my samples. x-axis is household index (1,2,3,4,...) and y-axis is calculated criteria.
Main problem of you proposed criterion:

case 1 : P = -100 , R = 500 ==> D=1.2
case 2 : P=-100 , R= 2000 ==> D=1.05

In case 2 we must have increase in D but we have decrease. How we can fix it?

In these conditions there isn't any problem :

case 1 : P = -100 , R = 2000 ==> D=1.05
case 2 : P=-200 , R= 2000 ==> D=1.1

We must have increase in case 2. that is true. we have increase.
Totally we have problem in case P<0 and R>0 .

Thanks.

5. ## Re: Finding criteria for a household financial budget falsification

Originally Posted by jacks12
Main problem of you proposed criterion:

case 1 : P = -100 , R = 500 ==> D=1.2
case 2 : P=-100 , R= 2000 ==> D=1.05

In case 2 we must have increase in D but we have decrease. How we can fix it?
Why does it need to be fixed? Seems good to me - the amount of discrepancy as a percentage of R is reduced from case 1 to case 2, so the score is reduced. Remember we agreed that larger discrepancies are OK for larger values of R.

Maybe you'd be happier with a system that simply looks at the difference betwen R and P without normalizing against anything. In other words:

D = (R-P)
score = D/(D+1)

I think you'll find that you will want to use the factor K we talked about before - given that R and P tend to be numbers in the thousands let K = 0.001 and see if that works:

score = KD/(KD+1)

The result will be that all of these examples yield the same score of 0.5, because all have the same R-P value:

Case 1: P = 1000, R = 2000
Case 2: P= 20000, R= 21000
Case 3: P = -500 R = +500
Case 4: P = -10000, R = -9000

6. ## Re: Finding criteria for a household financial budget falsification

Originally Posted by ebaines
Why does it need to be fixed? Seems good to me - the amount of discrepancy as a percentage of R is reduced from case 1 to case 2, so the score is reduced. Remember we agreed that larger discrepancies are OK for larger values of R.

Maybe you'd be happier with a system that simply looks at the difference betwen R and P without normalizing against anything. In other words:

D = (R-P)
score = D/(D+1)

I think you'll find that you will want to use the factor K we talked about before - given that R and P tend to be numbers in the thousands let K = 0.001 and see if that works:

score = KD/(KD+1)

The result will be that all of these examples yield the same score of 0.5, because all have the same R-P value:

Case 1: P = 1000, R = 2000
Case 2: P= 20000, R= 21000
Case 3: P = -500 R = +500
Case 4: P = -10000, R = -9000
Campare these two conditions :

case 1 : P = -100 , R = 500 ==> D=1.2
case 2 : P=-100 , R= 2000 ==> D=1.05

and

case 3 : P = 100 , R = 500 ==> D=0.8
case 4 : P= 100 , R= 2000 ==> D=0.95

the amount of discrepancy as a percentage of R is reduced in both conditions but We have decrease in 'D' in first condition and increase in 'D' in second condition. That isn't true?

for your second proposed criterion (D = (R-P)), I checked it. all off score values are 0.9 , 0.999 or 1. I tried some different K values but the result is the same. I think it isn't appropriate.

7. ## Re: Finding criteria for a household financial budget falsification

Originally Posted by jacks12
That isn't true?
No, it's not true. For case 2 the amount of discrepancy as a percentage of R is less than in case 1. Hence D is less.

Looking back at your post #1, were you were happy with using a normalizer of max(|P|,|R|), at least for the cases where both P>0 and R>0, or P<0 and R<0? Do you want to keep using those rules? If you were happy with the results those two cases, then you were happy generating scores of 1 any time when the lesser of P or R approaches 0. For example:

P = 100, R = 1000 --> score = 0.9
P = 1, R = 1000 --> score = 0.999

Even if R is small yields a high score:

P = 1, R = 10 --> score = 0.9

By this reasoning if P = 1 then score should equal 1, the highest possible score that you ever want assigned. And if P is negative then the score should be greater than 1, but you dont want any score greater than one. Consequently I suggest the following:

case a: If P>0 and R> 0 then score = (R-P)/R
case b: if P<0 and R<0 then score = (R-P)/|P|
case c: if P = 0 and R not equal 0 then score = 1
case d: if R = 0 and P not equal 0 then score = 1
case e: if P<0 and R>0 then score = 1

Think about it. You lose all sensitivity to changes in R and P if they are of different signs, but that's a consequence of being consistent with the criteria given for cases a and b and the rule that score must be less than or equal to 1.

By the way: on that scatter plot, how did you calculate the scores for the 2000 households? From their R and P data? What formulas did you use?

8. ## Re: Finding criteria for a household financial budget falsification

Originally Posted by ebaines
No, it's not true. For case 2 the amount of discrepancy as a percentage of R is less than in case 1. Hence D is less.

Looking back at your post #1, were you were happy with using a normalizer of max(|P|,|R|), at least for the cases where both P>0 and R>0, or P<0 and R<0? Do you want to keep using those rules? If you were happy with the results those two cases, then you were happy generating scores of 1 any time when the lesser of P or R approaches 0. For example:

P = 100, R = 1000 --> score = 0.9
P = 1, R = 1000 --> score = 0.999

Even if R is small yields a high score:

P = 1, R = 10 --> score = 0.9

By this reasoning if P = 1 then score should equal 1, the highest possible score that you ever want assigned. And if P is negative then the score should be greater than 1, but you dont want any score greater than one. Consequently I suggest the following:

case a: If P>0 and R> 0 then score = (R-P)/R
case b: if P<0 and R<0 then score = (R-P)/|P|
case c: if P = 0 and R not equal 0 then score = 1
case d: if R = 0 and P not equal 0 then score = 1
case e: if P<0 and R>0 then score = 1

Think about it. You lose all sensitivity to changes in R and P if they are of different signs, but that's a consequence of being consistent with the criteria given for cases a and b and the rule that score must be less than or equal to 1.

By the way: on that scatter plot, how did you calculate the scores for the 2000 households? From their R and P data? What formulas did you use?
The purpose of these works is that after creating this criteria i use a threshold and divide all households to two groups. One group is hiding something and one group isn't hiding something so finding a good criterion that satisfy all conditions and don't use different parameters ( for examples sometimes we have 'P' as denominator and sometimes we have 'R'). The second goal is have a criterion that is between 0 and 1 but finding a good criteria that satisfy all conditions is more important than the range ( as you said we can use D/D+1 ) .

Yes. for scatter graph I calculated proposed formula for 2000 households with R and P data. First I used D=(R-P)/abs(R) and after Score=D/D+1 as you said for every household with K=1 .

In these cases :

Case 1 : P = -100 , R = 500 ==> D=1.2
case 2 : P=-100 , R= 2000 ==> D=1.05

in case 1 household is saying that he has net budget=-100$and we investigate and find net budget of 500$
in case 2 household is saying that he has net budget=-100$and we investigate and find net budget of 2000$

winch one do you think hiding more and must have higher value of 'D' and 'Score'? . I think second case. what is your opinion?

9. ## Re: Finding criteria for a household financial budget falsification

Please gop back and reconsioder post #35. I really think it wil give what you want. Using D = R-P, score = KD/(KD+1), and setting K = 0.001, for the data you've previously provided we get:

case 1 : P = -100 , R = 500 ==> score = 0.375
case 2 : P=-100 , R= 2000 ==> score = 0.677
case 3 : P = 100 , R = 500 ==> score = 0.286
case 4 : P= 100 , R= 2000 ==>score = 0.655

Note that scores are purely a function of R-P. The bigger that discrepancy, the closer to 1 is the score. There are no issues of what to do with P or R negative versus positive or equal to 0.

10. ## Re: Finding criteria for a household financial budget falsification

Originally Posted by ebaines
Please gop back and reconsioder post #35. I really think it wil give what you want. Using D = R-P, score = KD/(KD+1), and setting K = 0.001, for the data you've previously provided we get:

case 1 : P = -100 , R = 500 ==> score = 0.375
case 2 : P=-100 , R= 2000 ==> score = 0.677
case 3 : P = 100 , R = 500 ==> score = 0.286
case 4 : P= 100 , R= 2000 ==>score = 0.655

Note that scores are purely a function of R-P. The bigger that discrepancy, the closer to 1 is the score. There are no issues of what to do with P or R negative versus positive or equal to 0.
Hi.

E = 1-(1/1+|R-P|)

So all values are between 0 and 1. But I have a problem. R and P values are large (for example 10000000) so We have large numbers in |R-P| and as a results we have large "E" values. So my solution for this problem was deciding all "R" and "P" values on a fix numbers like "100000". After doing this the distribution of "E" between 0 and 1 is more logical. But choosing the proper fix value is critical because in the cases that this fix value is larger the "E" output values are near "0" and in the cases this values is smaller outputs are near "1".
We eventually want to set a threshold and discriminant two groups of household. For example when We set fixValue=10000 ,a households has "E" value of 0.4 and when we set fixValue=1000 we have E=0.7. In the case 1 this household is from group A ( Threshold =0.5 ) and in the case 2 this household is from group B. What can I do for this problem?

Thanks,

11. ## Re: Finding criteria for a household financial budget falsification

Sorry, but I have nothing further to offer. As I noted a long tme ago: you have an impossible task here due to lack of data. I still think that the solution I offered back in post #20 does a good job at normalizing for large values of R and P if you have access to reported income or expenses - I suggest you think about that idea again.

Page 3 of 3 First 123