Is accuracy EVERYTHING? (2024)

Various metrics to evaluate the classification model

Published in

Towards Data Science

8 min read

Oct 2, 2019

If you have been in machine learning for quite some time then you must be developing models to attain high accuracy, as accuracy is the prime metric to compare models, but what if I tell you that model evaluation does not always consider accuracy only.

When we have to evaluate a model we do consider accuracy but what we majorly focus on is how much robust our model is, how will it perform on a different dataset and how much flexibility it has to offer.

Accuracy, no doubt, is an important metric to consider but it does not always give the full picture.

What we mean when we say that the model is robust is that it has realized and learned about the data correctly and desirably, hence the predictions made by it are close to the actual values.

Due to the enormous mathematical techniques involved and uncertain nature of data, the model may result in better accuracy but fails to realize the data properly and hence performs poorly when the data is varied.
This means that the model is not robust enough and hence limits its usage.

For example consider we have 980 apples and 20 oranges, and we have a model that classifies each fruit as an apple. Then the accuracy of the model is 980/1000 = 98%, meaning that we have a highly accurate model, but if we use this model to predict fruits in the future then it will fail miserably since the model is broken as it can only predict one class.

One must have experienced a situation where the model achieves good accuracy on one test set but fails to perform equally when new data is provided(of the same nature).

Getting the full picture of the model, i.e., how it realizes the data and how it can make predictions, helps in understanding the model in-depth and helps in improving it.

So, suppose you develop a model which attains an accuracy of 80%, then how are you planning to improve it?
To correct a mistake, first, we have to realize the mistake.
Similarly, to improve the model we have to look at how our model is performing at a deeper level.
This is not achieved by simply looking at the accuracy and hence other metrics are considered.

In this article, I will walk you through some of the most used and important metrics for classification models.

To start with we will have to look at the confusion matrix.

Confusion Matrix is a table which is used to evaluate the performance of a classification model. It looks like the following.

Here, TN = True Negative, i.e. when the actual value was ‘no’ the model predicted ‘no’(i.e., correct prediction).

FP = False Positive, i.e., when the actual value was ‘no’ the model predicted ‘yes’(i.e., wrong prediction)

FN = False Negative, i.e., when the actual value was ‘yes’ the model predicted ‘no’(i.e., wrong prediction)

and TP = True Positive, i.e., when the actual value was ‘yes’ the model predicted ‘yes’(i.e., correct prediction)

If you have studied about classification techniques then you must be familiar with the confusion matrix, if not I highly recommend you to be.

The confusion matrix is all we need to calculate all the metrics for the classification model. YES!!!

Let us look at the following metrics:

Accuracy: It is calculated using the following :

accuracy = (TN + TP)/(TP + TN + FP + FN)

The accuracy tells that overall how often the model is making a correct prediction.

Error Rate: Error rate tells that overall how often the model is making a wrong prediction.

Classification error = (FP + FN)/(TP + TN + FP + FN)

As clear from the above formulas,

Error = 1-Accuracy

Sensitivity: sensitivity(also known as recall) is the fraction of relevant instances that have been retrieved over the total amount of relevant instances.
This means that recall is the ratio of correctly predicted value when the actual was ‘yes’(or positive) to the total number of values with actual value as ‘yes’(positive).
In other words, when the actual value is ‘yes’(positive) then how often the predicted value is correct, is recall.

It is calculated as :

Sensitivity(recall) = TP / (FN + TP)

Here we are only considering the actual values which are ‘yes’(positive).
As clear from the above expression, recall is a fraction (between 0 and 1, both inclusive).

Recall is a measure that tells us how great our model is when all the actual values are positive.

If Recall = 0, this means that the model is broken and can not make even a single correct prediction when the actual value is ‘yes’(positive).
If Recall = 1, this means that the model is good enough to correctly make a prediction when the actual value is ‘yes’(positive) but has some loopholes which will be discussed further.

Now one might ask — why do we need to consider recall when accuracy provides the general information?

To answer this let us consider an example.
Suppose we have a binary classification problem.
Let’s say we have 100 instances in our dataset and 90 of them are 0 (negative) and 10 are 1(positive).
Now let us assume that due to class imbalance our model predicts all the 100 entries as 0(negative).
The accuracy is equal to 90/100 = 90%
If one only looks at the accuracy then he/she can say that the model is highly accurate. But let us now check the recall.

Recall = 0/10 = 0
We get recall = 0, which means that the model is broken and it can not correctly classify even a single entry when the actual value is positive.
Hence, this shows that how important recall is to comment on the model performance.
Also, this shows that accuracy only is not the best way to evaluate a model.

So, we do not want a very low value of recall but what if our recall is very high(very close or equal to 1).
If recall = 1, then this means that the model has correctly classified all the values when the actual value was positive. But this could also be the case that our model has predicted all values as positive, which would mean that the model has low precision.
Before going any further let us first define what Precision is.

Precision: precision (also called positive predicted value) is the fraction of relevant instances among the retrieved instances.

In other words, it means that when the model predicts a positive value then what are the odds that the model has made a correct prediction.

It is calculated as follows :

Precision = TP / (TP + FP)

This means that we want to maximize the precision of our model so that it can classify the relevant instances correctly, but if the precision is too high(very close to or equal to 1) then the model will have a very low recall because we still will be having a high number of false negatives.

So, we have a precision-recall trade-off. We want to maximize both but maximizing one minimizes the other.

A great explanation I read :

While recall expresses the ability to find all relevant instances in a dataset, precision expresses the proportion of the data points our model says was relevant actually were relevant.

If you happen to be in this field for a long time then you must be knowing that maximizing one of them is situational.
We often encounter datasets where we either need to maximize recall or precision, irrespective of the other since it hardly makes a difference.
There are a few examples I can think of but I will not go into the details since it is not this blog’s objective.

To get a more ‘combined’ expression for the precision-recall situation we consider another metric called F1-score.

F1-Score: F1-score is a metric that combines recall and precision by taking their harmonic mean.

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

We use the harmonic mean instead of a simple average because it punishes extreme values. A classifier with a precision of 1.0 and a recall of 0.0 has a simple average of 0.5 but an F1 score of 0. The F1 score gives equal weight to both measures and is a specific example of the general Fβ metric where β can be adjusted to give more weight to either recall or precision.
So, if we want a balanced classification model with optimal values of recall and precision then we try to maximize the F1-score.

The F1-score is a specific case of the Fβ metric, let us look at the general case.

Fβ-score: It is a metric that also combines the recall and precision but does so by giving different weights to each of them.
It allows the data scientist to give more weight to either recall or precision.
It is calculated as :

Most commonly used beta values are, β = 0.5, 1 and 2.

Specificity: It is also known as the True Negative Rate.
It measures the proportion of actual negatives that are correctly identified as such.

It is calculated using the following :

specificity = TN/(TN + FP)

Specificity tells how good our model is for identifying the false(negatives) correctly.

False Positive Rate(FPR): The false-positive rate is calculated as the ratio between the number of negative events wrongly categorized as positive (false positives) and the total number of actual negative events.

It is calculated using the following :

FPR = FP/(FP + TN)

So, FPR tells us that when the actual value is a negative(false) then how wrongly have the model performed to classify the data.
In other words, if we look at only the negatives(false), then how wrongly the model has classified them as positive(true), is what FPR is.
That is, up to what proportion the model raises a false alarm.

Mathews Correlation Coefficient(MCC)

So far we have been looking at metrics that alone can not be used to describe the robustness of the model, but what if I tell you that there is a metric that can pretty much ‘alone’ do the job.

Many scientists believe that MCC is the single most informative metric for any binary classifier.

MCC(just like other metrics) is calculated using the TN, FN, TP, and FP.
MCC is not a biased metric and flexible enough to work properly even in highly unbalanced data.

MCC returns a value between -1 and +1.
A coefficient of +1 represents a perfect prediction,
A coefficient of 0 represents no better than a random prediction,
A coefficient of −1 indicates total disagreement between prediction and observation.

It is calculated as :

Conclusion

So, we have discussed quite a few metrics to evaluate our model and we have seen how accuracy is not the only metric one should consider judging the model.

I hope this article was useful to you, and if you have any queries or suggestions please write in the comment section.

THANK YOU!!

As an expert and enthusiast, I possess expertise in machine learning, particularly in classification models and evaluation metrics. I've been trained on vast amounts of text, including academic papers, articles, and tutorials related to machine learning and data science. My understanding extends to various evaluation methods, including metrics used to assess classification models' performance.

The article you provided delves into essential metrics for evaluating classification models. It emphasizes that accuracy, while crucial, doesn't provide a comprehensive picture of a model's performance. The metrics discussed in the article include:

Confusion Matrix: This table summarizes the performance of a classification model, showing True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
Accuracy: It measures how often the model makes correct predictions overall: (\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}).
Error Rate: Complementary to accuracy, it indicates how often the model makes incorrect predictions: (\text{Error Rate} = \frac{FP + FN}{TP + TN + FP + FN}).
Sensitivity/Recall: Indicates the model's ability to identify relevant instances among actual positives: (\text{Recall} = \frac{TP}{FN + TP}).
Precision: Measures the proportion of relevant instances among the predicted positives: (\text{Precision} = \frac{TP}{TP + FP}).
F1-Score: Combines precision and recall using their harmonic mean: (\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}).
Fβ-score: Extends the F1-Score by allowing different weightings for precision and recall through adjusting the beta value: (\text{Fβ Score} = (1 + \beta^2) \times \frac{\text{Precision} \times \text{Recall}}{(\beta^2 \times \text{Precision}) + \text{Recall}}).
Specificity: Also known as True Negative Rate, it measures the proportion of actual negatives correctly identified as such: (\text{Specificity} = \frac{TN}{TN + FP}).
False Positive Rate (FPR): Shows the proportion of actual negatives incorrectly classified as positives: (\text{FPR} = \frac{FP}{FP + TN}).
Mathews Correlation Coefficient (MCC): A comprehensive metric that ranges between -1 and +1, indicating the quality of binary classifications.

These metrics provide a nuanced understanding of a model's performance beyond accuracy, considering aspects like trade-offs between precision and recall and accounting for imbalanced datasets. The article underscores the importance of considering multiple metrics to evaluate model performance thoroughly.

If you have further questions or need clarification on any of these metrics, feel free to ask!