Metrics for Classification Models

Evaluating categorical models is quite different from evaluating a regressive model. Where a regressive model can be evaluated with different error size metrics that are relatively intuitive, the black and white nature of a categorical prediction being right or wrong changes the way we have to measure our model’s effectiveness.

I find this difference almost ironic. Measuring a model that produces something as seemingly simple as a yes or no prediction can have more nuance than a model that predicts an independent variable with potentially infinite possible values. The easiest way to explore this is with the visual aid of a confusion matrix, which quantifies how a model has classified or misclassified testing data. 

The graphic below is a simple binary confusion matrix. We’ll use it to explore different metrics and why measuring the effectiveness of a binary classification model can be so difficult. 

Fig. 1: Confusion Matrix

Fig. 1: Confusion Matrix

Where a prediction of a continuous dependent variable can have a range of passable error, a binary predictor is either right or wrong with no in between. This dichotomy results in four classes of prediction: true negative, true positive, false negative, and false positive.

A truly perfect model would have no false positives or false negatives, but in practice a model that is more resistant to false negatives is more prone to false positives and vice versa. Is it more important to minimize false positives or false negatives? Are both catastrophic or can we accept one in order to minimize the other? And does prioritizing one form of misclassification harm the model overall?

Relationships between the four categories can be quantified in countless ways, but four of the most basic are called accuracy, recall, precision, and F1. This blog will briefly discuss each of these to explore their usefulness and shortcomings.

Accuracy

Accuracy is defined as a ratio of correct guesses to total guesses. ((TP+TN)/(TP+TN+FP+FN))

The accuracy measurement is simply the rate a model produces a correct classification. Of the four metrics discussed, accuracy is likely the easiest to gravitate to, whether naively or not. Its problems arise when there is an imbalance of actual positives and actual negatives. For example, if 1/20 of instances are actual positives, then a model that simply classifies everything as negative automatically has a 95% accuracy. This is easily misleading, and why it is ultimately important to be cautious of single metrics.

Recall

Recall (or sensitivity) is defined as the percentage of actual positives that are caught by the model. (TP/(TP+FN))

This fills the hole described in the overview of accuracy, but an over prioritization of sensitivity can result in an increase in false positives. If you are aiming to catch every single positive case, why not classify every case as positive? According to this metric that would be a perfect model. 

Precision

In contrast to recall, a model’s precision is the percentage of positive predictions that are correct. (TP/(TP+FP))

A precise model’s positive classifications can be trusted, but that may be at the cost of model sensitivity. If a model only classifies positive cases if they are blatantly obvious, it can potentially result in abundant false negatives. 

F1

Because recall and precision can be inversely related, it is helpful to see a higher level metric that takes both into account. The F1 score represents the harmonic mean of precision and recall.

((2 x Precision x Recall)/(Precision + Recall))

While F1 scores are slightly less intuitive in what they measure, they are still presented as a percentage so they are just as easy to interpret. A high F1 score is indicative of both high precision and recall scores, so an F1 score is an incredibly valuable metric to prioritize when building models.

Final thoughts

The metrics used to evaluate a classifying model are not one size fits all. They require a case-by-case evaluation on what is most important to the data being classified. For example, testing for disease may require different metric prioritization than detecting credit card fraud.

Having an understanding of a few different classification metrics and how they relate to each other is valuable in model creation and tuning. More metrics exist but the four in this blog are a great place to start for most models.

Previous
Previous

Scikit-learn Crash Course

Next
Next

Scientific Method