# Thread: Probability of Loan Repayments (Exercise)

1. ## Probability of Loan Repayments (Exercise)

Hi All,

Have a bit of a head scratcher on my hands. Been given an exercise in college where we have to create a system which highlights high/low risk loans based on a database given to us. The database provides information on 1000 previous loans, the details of those loans and whether or not there was a problem with repayment. I've used pivot tables to break down the data to see which were the problematic loans, and it appears like this:

Car Loans 3%
Repairs Loans 2.7%
Consolidation Loans 10.2%
Holiday loans 50%
Other Loans 3.1%

Region A 3.3%
Region B 14%
Region C 8.3%
Region D 4.4%

Paid with direct debit 0.2%
Not paid with Direct Debit 13%

The percentage beside each indicates the percentage of loans that had repayment problems.

Intuitively, I can see that the highest risk loan would be a holiday loan, from someone living in region B, choosing not to pay with direct debit. The problem I'm having is that I need the mathematics to support it! I was thinking of combining the probability of all of them, for example in this case it would be 0.5 + 0.14 + 0.13, leaving 77%, but that seems too straightforward, and I'm not confident! If anyone has the time and knowledge to help me with this It would be greatly appreciated, and if I need to provide any further info that I've left out feel free to ask.

Cheers,

Conor

2. ## Re: Probability of Loan Repayments (Exercise)

If I understand what you have done, you cannot properly add those probabilities. Without knowing how much probability theory you know, it is hard to give an answer that you will be certain to understand.

Based on your data, what is the total number of loans made that were holiday loans from Region B that were not to be paid by automated deduction through a debit card? What number of those loans had repayment problems? If you like to think in terms of tables, you need to be thinking in terms of two three-dimensional tables, one showing total loans in each category and the other showing problem loans in each category. As a theoretical matter, there will be a difficulty if any category has a small number of total loans. You may be expected to ignore this difficulty so long as the number of total loans in a category is positive, but you cannot ignore it if the number of loans is zero. What techniques have you been given to evaluate sample sizes?

3. ## Re: Probability of Loan Repayments (Exercise)

Hi Jeff and thanks for you're reply.

There is 9 total loans with the criteria you asked, all of which have problems. Does that mean that based on the data I was given, the probability of a problem for a 10th loan application matching that criteria would be 100%? Doesn't seem correct, in my mind it would have been necessary to calculate the individual probabilities and then create combined percentages.

4. ## Re: Probability of Loan Repayments (Exercise)

The problem here is that you are working with a very small sample, just nine loans. When you break down a loan portfolio of 1000 loans into 40 subsets, your average sample will be 25, and probably many samples will be far below that average. With small samples, the results may be very misleading.

You are mixing up the terms individual and marginal probabilities. The OBSERVED frequency for problems on a holiday loan is 50%. That is a weighted average across eight types of loan. It is not an individual observation; it is a marginal one. The OBSERVED frequency for problems on a holiday loan from Region B without automatic payment is 100%. That is your best estimate of that individual probability. Now as a I say, there is a reasonable chance that it is a bad estimate. Nevertheless, if I were credit officer for that bank, I'd say stop making such loans because the cost (in terms of loans written off) of getting a reliable sample will be very high.

You still have not told me how much you know about probability theory.

$P(B) \ne 0 \implies P(A | B) \equiv \dfrac{P(A\ and\ B)}{P(B)}.$ That is the definition of conditional probability.

When we know P(B) and P(A | B), we say $P(A\ and\ B) = P(A | B) * P(B).$

But in this case, how do we find P(A | B) in the first place? You have to look at P(A and B) and P(B).

Let me try putting it a different way. You have observed frequencies $F_1, ... F_{40}.$ Those are the raw data. Everything else is built from them. I have no idea what course you are taking or the context of the problem. Consequently, I cannot tell you what to do here. All I can tell you is that I would have instantly fired any credit officer who suggested we make holiday loans in region B without automatic payment. You have 9 loans sampled, and all 9 had problems. Does it really make any practical difference whether the true rate of problems for that class of loans is 90% or 100%?

5. ## Re: Probability of Loan Repayments (Exercise)

The aim of the exercise is to calculate the risk of accepting future loan applications based on the observed frequency of the first 1000 loans. So better off treating it as a probability exercise in my mind, ignoring the issue of sample size. To be honest I don't have a great grasp on probability

6. ## Re: Probability of Loan Repayments (Exercise)

In that case, you need to compute the observed frequency of total loans in each of the forty categories you are using for analysis, the observed frequency of problem loans in the same categories, and the ratio of problem loans to total loans in each category. That ratio is what relative frequency of problems has been ACTUALLY OBSERVED. You can do this easily in excel using four simple tables. This approach is a bit crude but is at least reasonable UNLESS you have one or more categories with 0 total loans. In the latter case, you need to let us know because this approach is then too crude to work.

7. ## Re: Probability of Loan Repayments (Exercise)

I checked in to see if you had responded. You actually need six tables, not four.