How we built Cleanlab Vizzy

Cleanlab Vizzy is an interactive visualization of confident learning — a data-centric AI family of theory and algorithms for automatically identifying and correcting label errors in your datasets. Confident learning enables you to train a model on bad labels that’s similar in performance to the model you would have had if you had trained on error-free data.

We built our company using confident learning and wanted to illustrate one of the primary algorithms of the field for a wider audience – we’ll refer to this as the “Cleanlab algorithm”. This post first introduces the intuition and theory behind the algorithm and then delves into how we built an in-browser visualization of it finding errors in a small image dataset.

Background

Briefly, the Cleanlab algorithm involves:

(1) Generating out-of-sample predicted probabilities for all datapoints in a dataset: We train on a dataset’s given labels with 5-fold cross-validation to obtain these probabilities.

(2) Computing percentile thresholds for each label class based on the predicted probabilities: Comparing predicted probabilities against per-class percentile thresholds can be informative. By percentile threshold, we mean that for some label class, we aggregate the corresponding predicted probabilities for that class for all datapoints that have that given label. Using these probabilities, we can compute values at each percentile (e.g. the median value is at the 50th percentile) and use these as thresholds.

Intuitively, these thresholds can be used to distinguish which datapoints are more or less likely to have a given label:

For example, datapoints with class probabilities exceeding a high class percentile threshold are more likely to have the corresponding class label.
Conversely, datapoints with class probabilities below a low class percentile threshold are unlikely to have the corresponding class label.

For a specific datapoint, if its predicted probability is below this low percentile threshold for each class, it may be the case that the datapoint belongs to none of the classes, i.e. it is out of distribution.

That’s all you need to know to understand the rest of this post.

I want to know more about the theory! (click to expand)

Consider an image dataset with millions of images labeled cat, dog, or bear. Unfortunately, like most datasets, some of these given labels may be incorrect — e.g., the dataset contains an image of a cat but the given label is bear, and another image of a dog but the given label is cat, etc. Because there are millions of images, we can’t check them all by hand without using an automated algorithm like cleanlab.

Each image has some true label $y_{true}$ , but the given label $y$ in the dataset may not match the true label, i.e. $y \neq y_{true}$ . For example, an image of a Golden Retriever has a true label dog, but may have been given the label cat due to human error or other factors. The Cleanlab algorithm (aka the Confident Learning algorithm) is a framework for identifying and correcting these label errors.

How does it work? Consider an approach where we train an image classifier on a dataset. For every image, the model outputs a predicted label $y_{pred}$ and a set of predicted probabilities $P(y|x)$ that sum to 1, where each probability corresponds to the model’s confidence that the image $x$ belongs to some class $y$ (i.e. that $y$ is a good label for image $x$ ). For example, given an image of a Golden Retriever, the model may output the predicted label $y_{pred} = dog$ with predicted probabilities $\begin{bmatrix} 0.8 & 0.1 & 0.1 \end{bmatrix}$ , where the probabilities correspond to the model’s confidence that the image is a dog, cat, or bear, respectively. In this example, the model is 8 times more confident that the image is of a dog than a cat or a bear.

A naive approach

One of the simplest approaches to find label errors in a dataset is to compare the predicted label against the given label. If they differ then we say that there is a label error. While this may work for some examples, there are two major shortcomings with this approach.

First, it does not take into account the model’s confidence. $y_{pred}$ is simply the label with the highest predicted probability, but the model may not be very confident in its prediction. Consider an image $x$ labeled as cat where the model outputs predicted probabilities $\begin{bmatrix} 0.34 & 0.33 & 0.33 \end{bmatrix}$ . Even though $y_{pred} = dog$ , the model is not very confident in its prediction, and so we may not want to say there’s a label error.

The second major shortcoming with this approach is that a model may be much more confident in some classes than others, but this approach doesn’t take that into account. For example, let’s say the model predicts a probability of 0.90 for class dog on average across all dog-labeled images, but predicts 0.35 for class cat across all cat-labeled images. Now say you have an image with probabilities $\begin{bmatrix} 0.50 & 0.49 & 0.01 \end{bmatrix}$ for dog, cat, and bear. Clearly, the model is relatively more confident in cat than dog if we take into account the average class confidence, but this naive approach would just assume the true label should be dog (even though 0.50 is much lower than 0.90, the average of other images labeled dog).

The Cleanlab approach

Instead of relying only on the predicted label, it would be better if we could use the predicted probabilities to assess whether the model is especially confident in a predicted label. This motivates the introduction of class thresholds, where for each class, we compute a threshold based on the mean of the class probability for all images that have that class as the given label. These class thresholds will address both shortcomings mentioned above.

If an input $x$ has a predicted probability for dog that exceeds the corresponding class threshold, we can say that the model is especially confident in the prediction $y_{pred} = dog$ , as compared to other images that were labeled dog. If $x$ is labeled cat, we are then able to say with more assurance that there is a label error.

Though simple, class thresholds provide a powerful way for us to account for the model’s confidence when identifying label errors. When the threshold is set to the mean self-confidence, i.e. what we described above, it is theoretically proven to exactly find label errors under certain conditions (allowing for error in every predicted probability out of a model for every example and every class).

Constructing the confident joint

With the predicted probabilities for each image and the thresholds for each class, we can now categorize each image based on its given label and suggested label. We introduce this new term, suggested label, to signify that this is the predicted label with a probability that exceeds the class threshold, so Cleanlab is actively suggesting that the predicted label is correct.

We can construct a 3 x 3 matrix to represent these two dimensions (given label, suggested label). This is the confident joint. Notice that any image that is placed in the diagonal is an instance where the given label and the suggested label match, i.e. no label issue has been identified for this example. In contrast, any image placed in the off-diagonal is an instance where Cleanlab has suggested a label that differs from the given label, i.e. Cleanlab thinks that the image is labeled incorrectly.

Last bit of theory: Out of distribution or Unconfident

What happens to images that don’t make it into the confident joint?

Recall that for there to be a confidently suggested label, at least one of the three predicted probabilities must be greater than its corresponding class threshold. It is possible for none of the probabilities to exceed their class thresholds, especially for abnormal examples that may not even belong to any of our classes. While our model is not confident about any such examples not counted as part of the confident joint, their labels may still be worthy of close review. The cleanlab package provides additional algorithms that can automatically identify examples that are outliers (out of distribution).

One possible treatment is to consider all these examples to be out of distribution, meaning they may not belong to any of the classes in our classification task at all. For example, if there is an image of an airplane in our dataset, our model may output very low probabilities for all three classes, e.g. $\begin{bmatrix} 0.34 & 0.33 & 0.33 \end{bmatrix}$ , so it will not make it into the confident joint. In this case, we can say that our model has correctly identified that this image is out of distribution.

In practice, however, a significant percentage of in-distribution examples do not make it into the confident joint. This may be due to the image being an atypical example of a class, say, a cat in a Halloween costume. For such an example, the predicted label would still be cat, but the predicted probability would not be high enough to exceed the class percentile threshold. A better description would be to say that Cleanlab is not confident that the predicted label is correct, and does not have sufficient grounds to make a determination.

We want to distinguish between images where Cleanlab is unconfident and images that are out of distribution. This motivates the introduction of an out-of-distribution threshold, which we set to the 10th percentile value of predicted probabilities for each class. The lower this threshold is, the more likely it is that images below the threshold are poorly described by any of the classes specified for the classification task. This slider too is part of the visualization and can be controlled by the user.

Any images that are not out of distribution and not in the confident joint are simply unconfident examples.

Implementation

We built our visualization using React, Typescript, and Chakra UI. (Chakra UI has particularly good support for dark mode!)

Taking a leaf out of playground.tensorflow.org, we deliberately constrained the visualization to fit within the browser window so that scrolling is not needed.

We also wrote the application to run entirely in the browser, including model training and evaluation. This avoids any latency from having the frontend communicate with a backend.

Dataset

For visual appeal and interpretability, we showcase the algorithm working on an image dataset.

To construct the dataset, we manually selected 300 images from ImageNet, sampling mainly from images of dogs, cats, and bears. We include 97 examples of each of these classes. For the remaining 9 images, we randomly selected pictures of non-animals. These serve as out-of-distribution images.

The images of dogs, cats, and bears mostly have the correct given label, except for 3 examples in each class, which are deliberately given an incorrect label. The out-of-distribution images are randomly assigned to one of the animal classes. Overall, this means that our dataset of 300 images has 18 errors.

Generating predicted probabilities

To generate predicted probabilities, we have to train an image classifier on the dataset.

One option is to train a neural network in the browser from scratch, and Javascript libraries like ML5 and ConvNetJS are good choices for this. However, even for a small dataset, this can take up to a few minutes depending on the complexity of the neural network.

Instead, to optimize for speed, we pre-computed image embeddings using a pre-trained ResNet-18 model. For each image, using the penultimate layer in the model, we obtain a 512-dimensional image embedding that serves as a high quality compressed representation of the image. During training time, the app uses these embeddings to train an SVM classifier (LIBSVM-JS) to quickly obtain the predicted probabilities.

Given the small size of our dataset ( $n = 300$ ), we reduce the dimensionality of each vector to 32 using truncated singular value decomposition, as the dimensionality of the inputs should not exceed the number of examples the classifier learns from. This also speeds up training of the SVM classifier to a fraction of a second.

As a sanity check, we used t-SNE to represent these datapoints on a 2-dimensional plot to verify that the image embeddings were being generated correctly.

The data is clustered fairly well into the three classes. We can see that the clusters for cats and dogs are overlapping somewhat, while the cluster for bear is relatively distinct, which is what we would expect intuitively given the relative sizes of these animals. The datapoints that are out-of-distribution mostly lie further away from the centroids of the three clusters. All is well!

Computing class thresholds

With the predicted probabilities, computing percentile thresholds for each class is straightforward. By default, we set the class thresholds at the 50th percentile (aka the median), and the out-of-distribution thresholds to the 15th percentile. However, sliders in the app allow the users to set this to anywhere from the 0th to 100th percentile.

Note that:

A higher class threshold would mean that any errors identified are found with higher confidence.
A lower out-of-distribution threshold means that any out-of-distribution examples identified are found with higher confidence.

Fun fact: According to the original paper, how different percentile thresholds perform in identifying label errors is still unexplored future work!

Putting it all together

Based on the predicted probabilities and class / out-of-distribution threshold, we can categorize every datapoint based on its given label and Cleanlab’s suggested label.

In this case, since we have three classes, we can construct a 3 x 3 matrix to represent these two dimensions (given label, suggested label). This is the confident joint matrix.

Datapoints that do not make it into the confident joint (i.e. Cleanlab was not confident enough to suggest a label) are categorized as either out of distribution or examples Cleanlab is not confident about. (Any image that is not in the confident joint or the set of out-of-distribution images is an image Cleanlab is not confident about.)

If you have made it this far, you now have a better understanding of how Cleanlab’s algorithm for automated label detection and correction works and how we built our visualization for it. The code for our visualization is open source. You may also want to check out the Cleanlab repository for our open-source Python package, or join our Slack community if you are broadly interested in data-centric AI.

While cleanlab helps you automatically find data issues, an interface is needed to efficiently fix these issues your dataset. Cleanlab Studio finds and fixes errors automatically in a (very cool) no-code platform. Export your corrected dataset in a single click to train better ML models on better data. Try Cleanlab Studio at https://studio.cleanlab.ai/.

Browse all Next

Use ActiveLab to efficiently choose which data to (re)label to train the best Transformer model.

Ensuring Reliable Few-Shot Prompt Selection for LLMs

Learn data-centric techniques for better few-shot prompting when applying LLMs to noisy real-world data.

Safeguard Customer Data via Log Compliance Monitoring with the Trustworthy Language Model

How enterprises can use LLMs to reliably catch compliance violations like GDPR from log files.

Get started today

TLM is free to try and adds a reliabilty layer to RAG and GenAI systems in a few lines of code.

Try for free Contact sales

More resources

Explore applications of Cleanlab via blogs, tutorials, videos, and read the research that powers this next-generation platform.

Join us on Slack

Join the Cleanlab Community to ask questions and see how scientists and engineers are practicing Data-Centric AI.