A Simple Adjustment Improves Out-of-Distribution Detection for Any Classifier

Anyone who has tried training ML models on real-world datasets (not the perfectly curated data we work with in school) has probably dealt with outliers in data. The problem with some Out-of-Distribution (OOD) detection algorithms is they make a big assumption — that a model is equally confident in all classes — most of the time, that assumption is false. For example, a model trained on ImageNet is typically overconfident (predicted probabilities close to 1) for bananas but underconfident (low predicted probabilities close to 0) for the ten different but very similar-looking lizard classes in the dataset.

In this article, we’ll show you a novel and simple adjustment to model predicted probabilities that can improve OOD detection with classifier models trained on real-world data. Our unique approach is rooted in theory and runs in just a couple of lines of code.

Background

Identifying outliers in test data that do not stem from the distribution of the training data is critical for deploying reliable machine learning models. While many special (e.g., generative) models have been proposed for this task of Out-of-Distribution (a.k.a. anomaly/novelty) detection¹²³, these are often specific to particular data types and nontrivial to implement. Instead, simpler methods for OOD detection that use an already-trained classifier on data with class labels have become quite popular⁴⁵. Methods like KNN distance⁶⁷ and Mahalanobis Distance³⁴ leverage a trained neural network’s intermediate feature representations to identify OOD examples.

An even simpler approach is to only use the predicted class probabilities output by the trained classifier and quantify their uncertainty as a measure of outlyingness. Two particularly popular OOD methods are Maximum Softmax Probability (MSP)⁵ or Entropy⁴⁶. Compared to most other methods, MSP and Entropy need less information from the model and require less computing to identify outliers. Here we introduce a simple improvement to these baseline methods to improve their effectiveness.

Baseline prediction-based OOD detection methods

Consider an image $x$ and classifier model $p = h(x)$ where $p$ is the model’s predicted probability vector that this image belongs to each class $k \in {1,…,K}$ . Based on $p$ , one can compute two simple OOD scores for $x$ .

Maximum Softmax Probability (MSP) — quantifies how confident the model is in the most likely class it predicts:

\mathrm{MSP}(x) = \argmax_{1 \le k \le K} \ \{p_k\}

Entropy — quantifies how evenly spread the model’s probabilistic predictions are amongst all $K$ classes:

\mathrm{Entropy}(x) = -\sum_{k=1}^K p_k \text{log}(p_k)

These scores have been shown to work surprisingly well for detecting OOD images⁵, despite the fact that they do not explicitly estimate epistemic uncertainty⁴.

Simple adjustment to Improve Baseline Methods

The model predicted probabilities $p$ are subject to estimation error. Trained models can have a biased propensity to predict specific classes over others, particularly when the classes in the original dataset are imbalanced. To account for these issues, we adjust the predicted probabilities using class confident thresholds¹⁴, forming new OOD scores based on the MSP/Entropy of the resulting adjusted predictions.

Calculating Class Confident Thresholds

Letting $y_i$ denote the class label for the i-th example in our training data and $p_{i,k}$ denote the probability that this example $x_i$ belongs to class $k$ according to our model, we compute a confident threshold vector $c$ whose k-th element is defined as:

c_k = \frac{1}{N_k} \sum_{i=1}^{N} p_{i,k} \cdot \mathbb{\mathbf{1}}{[y_i = k]}

The confidence thresholds are the average probability of a class predicted by our model amongst the examples labeled as that class. This vector thus represents our model’s propensity to predict a particular class for examples labeled as that class and has been proven to be a natural threshold for determining the reliability of probabilistic predictions¹⁴.

Adjusting Model Predicted Probabilities for Noise

For any new example $x$ , its predicted probability vector $p = h(x)$ is subsequently adjusted by the class confident threshold as follows:

\tilde{p} = \frac{1}{Z} \big[ p - c + \bar{c} \big]

Here scalar $\bar{c}$ is the largest value in the confident threshold vector (to ensure nonnegative probabilities) and scalar $Z$ is a normalizing constant (to ensure the probabilities sum to one over the classes):

\bar{c} = \max_{1\le k\le K} \ \{c_k \}\text{, } Z = \sum_{k=1}^K \big[ p_k - c_k + \bar{c} \big]

While the confident thresholds vector $c$ is always calculated using the train-predicted probabilities and labels. Any model output predicted probabilities (i.e., for additional test data) could be adjusted using these thresholds.

Computing adjusted OOD scores

Improved OOD scores for $x$ are achieved simply by plugging in the adjusted predicted probabilities $\tilde{p}$ in the place of $p$ into either of the respective MSP/Entropy formulas. This adjusted OOD detection procedure thus remains extremely simple and is easy to implement in practical deployments.

Evaluating OOD Detection Performance

Following standard OOD benchmarking procedures, existing image classification datasets are grouped in pairs where: one dataset is used to train a Swin Transformer⁸ classifier and considered in-distribution training data, while examples from the second dataset are mixed in with the testing data of the first dataset (at a 50–50 ratio) as out-of-distribution images. Each OOD scoring method is applied to all images in the test set (without knowledge of their source or their labels) to produce a ranking of these images, which we evaluate using the AUROC for how well these scores detect OOD examples.

We consider 2 different OOD detection problems based on popular image classification datasets: CIFAR-10⁶ vs. CIFAR-100⁶ and MNIST⁵ vs. FASHION-MNIST⁷. Our first benchmark relies on the original versions of these datasets, where classes naturally occur in equal proportions.

We also run a second benchmark in which we introduce class imbalance in each training set. Here we create new imbalanced training sets for CIFAR-10, MNIST, and FASHION-MNIST, where in each training set: 6 classes each contain 2% of the total examples and 4 classes each contain 22% of the examples. We also create an imbalanced training set for CIFAR-100 in which 90 classes each have 0.63% of the examples, and 10 classes each have 4.25% of the examples. This allows us to evaluate how well our OOD scores perform in settings where the classes occur in unequal proportions in the labeled training data, as is often the case in real-world applications.

Benchmark Results

Tables 1 and 2 list the AUROC performance achieved by both adjusted and original (non-adjusted) OOD scoring methods for each benchmark setting. For many in-distribution / OOD dataset pairs, there is a clear improvement that results from our proposed adjustment.

Table 1: Performance (AUROC) of Out-of-Distribution detection with original (balanced) datasets (higher is better).

Table 2: Performance (AUROC) of Out-of-Distribution detection with imbalanced datasets (higher is better).

With only a minor adjustment to the predicted probabilities output by a trained classifier, the performance of both Entropy and MSP-based out-of-distribution detection scores is increased.

Reproduce these benchmarks yourself here: ood-detection-benchmarks.

Try improved OOD Detection yourself in 3 lines of code

Implementing this adjusted OOD scoring can easily be done in practice with the cleanlab library. If you already have data and a trained model you can do:

python

0
1
2
3
4
5
from cleanlab.outlier import OutOfDistribution
ood = OutOfDistribution()ood.fit(pred_probs=train_pred_probs, labels=train_labels)ood_scores = ood.score(pred_probs=test_pred_probs)

Scores in cleanlab are normalized to lie in [0,1] with values near 0 indicating examples more likely to be out-of-distribution.

Resources to learn more and run OOD detection on your own data

5 min Tutorial to run OOD detection on your own data
Example of advanced OOD detection code
Code to reproduce the benchmarks presented here
cleanlab open-source library
Cleanlab Studio: no-code data improvement

Join our community of scientists/engineers to ask questions and help build the future of open-source Data-Centric AI: Cleanlab Slack Community

References

Yang, J., Zhou, K., Li, Y., and Liu, Z. Generalized out-of-distribution detection: A survey. arXiv:2110.11334. 2021. ↩ ↩² ↩³
Ran, X., Xu, M., Mei, L., Xu Q., and Liu Q. Detecting out-of-distribution samples via variational auto-encoder with reliable uncertainty estimation. Neural Networks. 2022. ↩
Cao, S., and Zhang, Z. Deep Hybrid Models for Out-of-Distribution Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. ↩ ↩²
Kirsch, A., Mukhoti, J., van Amersfoort, J., Torr, P. H. S., and Gal, Y. On pitfalls in OOD detection: Entropy considered harmful. ICML Workshop on Uncertainty and Robustness in Deep Learning. 2021. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Hendrycks, D. and Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017. ↩ ↩² ↩³ ↩⁴
Kuan, J., and Mueller, J. Back to the Basics: Revisiting Out-of-Distribution Detection Baselines. ICML Workshop on Principles of Distribution Shift. 2022 ↩ ↩² ↩³ ↩⁴
Angiulli, F. and Pizzuti, C. Fast outlier detection in high dimensional spaces. In European conference on principles of data mining and knowledge discovery, 2002. ↩ ↩²
Lee, K., Lee, K., Lee, H., and Shin, J. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems, 31, 2018. ↩

Browse all Next

Introducing new data quality algorithms for multi-label classification in cleanlab v2.2

ActiveLab: Active Learning with Data Re-Labeling

ActiveLab helps you optimally choose which data to (re)label, lowering the cost to train an accurate ML model.

Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?

A comprehensive benchmark of evaluation models to automatically catch incorrect responses across five RAG applications.

Get started today

TLM is free to try and adds a reliabilty layer to RAG and GenAI systems in a few lines of code.

Try for free Contact sales

More resources

Explore applications of Cleanlab via blogs, tutorials, videos, and read the research that powers this next-generation platform.

Join us on Slack

Join the Cleanlab Community to ask questions and see how scientists and engineers are practicing Data-Centric AI.