The very basics of scorecards
Among my career highlights, I count having helped to launch the first-ever credit bureau scores in the Philippines, Thailand, and Malaysia. Despite this, and despite the fact that I’ve been involved in countless other scorecard deliveries, I actually only know enough about their mechanics to be dangerous.
All of which is to say that the ‘very basics’ in the title is purposefully chosen.
And I think that’s all right. Hopefully, I will not offend any practitioners by saying this, but the only thing complicated about scorecards is the mathematics. Their purpose is usually quite simple and their history perhaps surprisingly far-reaching. In the first episode of How to Lend Money to Strangers, I talk to one of the grandmasters of scorecard building, Raymond Anderson, about this history and the enveloping history of risk assessment. If you’re interested in scorecard mechanic beyond the surface, you’d do well to consider one of Raymond’s books on the topic.
Anyway, thus warned, here is my simple primer. A scorecard is an attempt to translate risk, a complicated and multi-faceted concept, into a single easy-to-compare number. I might be young and new to credit but earning a stable income, for example, while you may be established in your career and credit history but in an industry prone to economic peaks and troughs. Both of us have aspects that represent higher than average risk, but to what extent?
If we want to build an effective strategy, we need to know just how much riskier than average. ‘Riskier’ here refers to future behaviour, of course. It is easy to see who is in a worse financial position today, just by looking, but we want to know how two individuals’ situations may vary in the future. That’s harder to do. In fact, it is impossible. But with good data, we can usually get pretty close.
To go about that we start by shifting our perspective a little. Since we can’t know the future, we use the present as a stand-in and try to see if we could have predicted what’s happening today given only what we knew at some specific point in the past.
Let’s assume we’re going to use all accounts opened in June 2020 as our data sample. In a typical project, we might pull all of those accounts’ data from May 2019 to May 2020 (observation period) to represent the known history, as it would have been when the loan was opened. And then we’d use the twelve months from June 2020 to May 2021 as a proxy for the future, for what we don’t know (performance period).
That’s the ‘when’ answer to the future question. To answer the ‘what’ question, we need a ‘bad definition’. We talk about ‘bad’ because in lending risk it is usually based on a level of delinquency (often whether an account goes more than 90 days past due) but it is really a definition of the activity we’re trying to predict. There are even occasions where we might actually be looking for a positive outcome: for example, in a late-stage collections score we may target consumers who actually make a payment.
In all cases, we want to pick an outcome that is sufficiently common to create a workable population but also stable enough to minimise noise. So even though an ever 30+ bad definition would capture more bads, many consumers who miss one payment might do so for administrative reasons or might otherwise be able to cure so mixing them into the population would only dirty the waters. At least that’s the case for something like a credit card. In a product like a bank overdraft, where consumers who miss one payment invariably miss more, and where missed payments are less common overall, an ever 30+ bad definition might be perfect.
Once we have our bad definition, we also need to choose a good definition. And ‘good’ here is not simply ‘not bad’. Again what we’re trying to do is avoid noise, so in a model where we label an account as bad if it exceeds 90 days past due, we may only label an account as good if it never exceeds 30 days past due.
The accounts in the middle are ‘indeterminates’. These are accounts that a quick analysis of the roll-rates tell us are just as likely to cure as to roll, and so including them in either of our definitions would only add noise to both.
Now, all we need to do is compare and contrast the two groups of interest: what characteristics do ‘good’ accounts have in common with other ‘good’ accounts and ‘bad’ accounts have in common with other ‘bad’ accounts, and which of those characteristics do ‘good’ accounts not also share with ‘bad’ accounts. These will be the predictive characteristics, and this is where the complicated maths comes in. That said, whether the statistician is using traditional regression or more modern machine learning tools, they’re only really comparing and contrasting. And measuring.
Let’s assume that the age of a borrower is shown to be a predictive characteristic: consumers aged 18 to 25 have a higher-than-average tendency to go ‘bad’, those aged 26 to 40 go bad at the rate of the population average, and those aged 40+ are less likely to go bad. The young cohort would get a negative score, the middle cohort would get a neutral score, and the older cohort would get a positive score.
It is not sufficient to identify the predictive characteristics and the direction of their influence; we also want to establish precisely how predictive each of those characteristics is. That complicated maths I mentioned earlier is what makes this possible, but for our purposes, it is enough to know that the degree to which each characteristic influences risk can be measured—because it is essentially the sum of those influences that creates our scorecard.
I am now bumping up against the line that demarks too much simplification, but in our earlier scenario let us also assume that the number of delinquencies in the previous six months is the most predictive of all the characteristics; that the presence of a delinquency in the previous six months is ten times more important than the fact that someone is aged 18 to 25. We might then get a scorecard that looks something like the one below:
Typically, two big questions remain at this point: what about double-counting, and what does a score of some given value mean?
First and simplest, double counting is taken care of by the maths so it is a good question but one we can ignore—someone who is aged 20 and who has one delinquency in the last six months will get -10 for their age plus -100 for their delinquency, for a total score of -110.
As for the second question, on one level the specific values are largely irrelevant. No two scores are the same, and the range, scaling, and even direction could be completely different. So 700 on its own is meaningless. What is important is that we have a number system that accurately explains the risk of one characteristic relative to another one. So a score of 700 is meaningless… until it is seen alongside another score of 720.
And here I have to admit that I cheated a little bit earlier. The score you see is a representation of an underlying series of odds, and that relationship is seldom direct. What I mean to say is that a score of 70 will not represent a risk that is ten times lower than a score of 700.
In every population I have ever seen, risk is exponential. That is to say that if we lined everyone up from least risky to most, the rate at which risk increases from one person to the next will increase as you move down the line. In a group of one hundred, the first in line would be almost imperceptibly less risky than the person behind them, and not all that much riskier than the person in twentieth or even thirtieth spot whereas the person last in line might be twice as likely to default as a person just ten places in front.
To represent this while still keeping the number of scores under control, we tend to talk about a risk-doubling ratio – something like the odds of default doubling every time the score increases by 10 or 20 points (or halving when it decreases by the same amount). We use odds rather than expected bad rates for this, so the population is said to have three good customers for every bad one, instead of a 25% bad rate. This also means that when we talk about doubling, it is the number of goods per bad that we double, not the percentage bad rate, so it goes from 3:1 (25% bad rate) to 6:1 (14.3%).
And the score is then anchored around a given point. We might, for example, set it such that a score of 700 is where good:bad odds are 10:1, doubling every 20 points from there. In fact, the chart below represents that scenario for a score that ranges from 600 to 800.
This ‘trick’ allows us to quickly compare the relative risk of a consumer with a score of 660 to one with a score of 680 and thus, in a slightly roundabout way, to say what a score of 700 means, or indeed a score of any given value.
Note that this not approach is not always the one that will be used, sometimes the risk is directly turned into a score, but in my experience, some form of adjusting for exponential underlying risk relationships is usually required.
I’m dedicating quite a few words to this somewhat niche point, not just because it was a pet peeve of a former boss of mine (hello Ezra), but because it can have some subtle implications for those who use scores for their data analytics.
It can be tempting to compare two groups of customers based on their average score. But consider these two scenarios: in both, we have two customers, in the first with scores of 600 and 800 respectively, and in the second with scores of 680 and 720. In both scenarios, the average score is 700, but the actual risk, the good:bad ratio or the expected bad rates, are vastly different.
That first portfolio, with its extreme differences, is made up of one consumer with an expected bad rate of 76.2% and one with an expected bad rate of 0.3%, for an average expected bad rate of 38%. Contrast this to the second portfolio where the average expected bad rate is nearly a quarter of that, made up as it is by one consumer with an expected bad rate of 16.7% and one with an expected bad rate of 4.8%.
This is, of course, an exaggerated example, and since most populations congregate around a mean, we can often get away with using the simple approach. But short-cuts aside, it is better to calculate the risk of each consumer in a population and to average that, then to find the score that most closely aligns with the portfolio risk.
So in our example, because the second portfolio has an average risk of 10.7% (good:bad odds of about 8.4:1) it is better represented by a score of 694. This is not significantly different from the 700 we first used, but it is very much better than the score of 646 which would be the stand-in of choice for the first portfolio.
But I digress. Suffice to say that the score value must be read in context, because that is what it is for: explaining the risk of one subject relative to any other. This is very useful information, but importantly it is not a decision. The decision to grant credit should be specific to the current situation, weighing factors like the inherent risk of the product sought, the terms of the deal on the table, the would-be borrower’s affordability position, any relevant laws and regulations, any internal policy rules, and of course fraud concerns – all of which are topics I’ll expand upon in later articles and in the show.