Error Metrics for Skewed Classes: Precision, Recall, and F1 Scores

16 Jan 2017

You’re blind, and you have a jar of white and black marbles. You want to get good at retrieving only black, and all of the black, marbles from such jars. Let’s also say you have skewed classes: There are many more white than black marbles ($95\%$ or more of the marbles are white).

How do you judge how good you are at finding black marbles? One way is to look at the accuracy of your classification: what fraction of the marbles did you classify correctly?

The problem is that you could simply classify everything as white, and get an accuracy of greater than $95\%$, without having improved at retrieving black marbles. We need better metrics, but which ones?

Let’s think about what someone with vision would do. They would:

retrieve all the black marbles, and
leave all the white marbles in the jar.

Notice that you could accomplish (1) without (2) and vice versa. That is, you could retrieve all the black marbles in the jar, but accidentally also retrieve some white marbles. Or, you could walk away with only black marbles, but accidentally leave a bunch of them in the jar.

Recall (or sensitivity) is how well you do (1). It’s the fraction of total black marbles that you retrieve. More generally, it’s the fraction of the relevant instances that were retrieved.

There are a couple ways to measure (2). You could use the fraction of the total white marbles that stayed in the jar (ie specificity). Or, if you’d rather just look at what you’ve retrieved, you could use the fraction of marbles retrieved that is black (ie precision). More generally, precision is the fraction of the retrieved instances that are relevant.

(Personally, I think precision is more natural than specificity. I’d rather just look at the marbles in my hand and check that most or all are black. But the benefit of specificity is that if you don’t retrieve any marbles (by classifying them as all white), specificity is equal to one whereas precision is undefined.)

If metrics of both (1) and (2) are equal to one, then you have a perfect classifier.

$F_1$ score and harmonic means

Say a disease occurs 1 in every 100 people. Another disease occurs 1 in every 2 people. How would you compute the average of such rates?

One way is to take the arithmetic mean of the fractions $1/100$ and $1/2$, which is $0.255 \approx 1/4$. Wait, 1 in 4 people is the average rate? We know intuitively that the answer should be 1 out of every 50 people. What is wrong with this computation?

The arithmetic mean of the rates considers the distance between 1/2 and 1/3 to be much greater than the distance between 1/99 and 1/100. However, we want these distances to be the same. To do that, we need to consider the reciprocals of the rates. While a rate tells you what fraction of a group has a disease, the reciprocal of the rate tells you how many people you need to see one instance of the disease.

The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals, and should be used to find the average of rates. The $F_1$ score is the harmonic mean of precision and recall, or

\[F_1 \text{ score} = \frac{1}{\frac{\frac{1}{P} + \frac{1}{R}}{2}} = 2 \frac{PR}{P + R}.\]