Handling Label Errors in Text Classification Datasets

  • Wei Jing LokWei Jing Lok
  • Jonas MuellerJonas Mueller


In machine learning and natural language processing, we train models to predict given labels, assuming that these labels are actually correct. However recent studies have found that even highly-curated ML benchmark datasets are full of label errors, and real-world datasets can be far lower quality. In light of these problems, the recent shift toward data-centric AI encourages data scientists to spend at least as much time improving their data as they do improving their models. No matter how much you tweak them, the quality of your models will ultimately depend on the quality of the data used to train and evaluate them.

The open-source cleanlab library provides a standard framework for implementing data-centric AI. cleanlab helps you quickly identify problems in messy real-world data, enabling more reliable machine learning and analytics. In this hands-on blog, we’ll use cleanlab to find label issues in the IMDb movie review text classification dataset. Commonly used to train/evaluate sentiment analysis models, this dataset contains 50,000 text reviews of films, each labeled with a binary {0, 1} sentiment polarity value indicating whether the review is overall positive (1) or negative (0).

Here’s a review that cleanlab found in the IMDB data, which has been incorrectly labeled as positive:

Like the gentle giants that make up the latter half of this film’s title, Michael Oblowitz’s latest production has grace, but it’s also slow and ponderous. The producer’s last outing, “Mosquitoman-3D” had the same problem. It’s hard to imagine a boring shark movie, but they somehow managed it. The only draw for Hammerhead: Shark Frenzy was it’s passable animatronix, which is always fun when dealing with wondrous worlds beneath the ocean’s surface. But even that was only passable. Poor focus in some scenes made the production seems amateurish. With Dolphins and Whales, the technology is all but wasted. Cloudy scenes and too many close-ups of the film’s giant subjects do nothing to take advantage of IMAX’s stunning 3D capabilities. There are far too few scenes of any depth or variety. Close-ups of these awesome creatures just look flat and there is often only one creature in the cameras field, so there is no contrast of depth. Michael Oblowitz is trying to follow in his father’s footsteps, but when you’ve got Shark-Week on cable, his introspective and dull treatment of his subjects is a constant disappointment.

The rest of this post demonstrates how to run cleanlab to find many more issues like this in the IMDB dataset. We also demonstrate how cleanlab can automatically improve your data to give you better ML performance without you having to change your model at all. You can easily use the same cleanlab workflow demonstrated here to find issues in your own dataset. You can run this workflow yourself in under 5 minutes:

Overview of steps to find label issues and improve models

Depicted below are the high-level steps involved in finding label issues in text classification data with cleanlab.

This blog will walk through the following workflow:

  1. Construct a TensorFlow/Keras neural net and make it scikit-learn compatible via SciKeras.

  2. Use this classifier to compute out-of-sample predicted probabilities via cross-validation.

  3. Use these predictions to find bad labels via cleanlab’s find_label_issues method.

  4. Train a more robust version of the same neural net via cleanlab’s CleanLearning wrapper.

The remainder of this blog lists the step-by-step code needed to implement this workflow.

Show me the code

We'll start by installing and importing some required packages (click to see code)

You can use pip to install the dependencies for this workflow:

pip install cleanlab sklearn pandas tensorflow tensorflow_datasets scikeras

Our first few Python commands will import some of the required packages and set some seeds for reproducibility.

import os
import random
import numpy as np

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"  # Controls amount of tensorflow output

SEED = 123456  # Just for reproducibility


Prepare the dataset

The IMDb text dataset is readily provided in TensorFlow’s Datasets.

import tensorflow_datasets as tfds

raw_full_ds = tfds.load(
    name="imdb_reviews", split=("train+test"), batch_size=-1, as_supervised=True
raw_full_texts, full_labels = tfds.as_numpy(raw_full_ds)

num_classes = len(set(full_labels))  # 2 for this postive/negative binary classification task
print(f"Classes: {set(full_labels)}")
Classes: {0, 1}

Let’s look at the first example in the dataset.

i = 0
print(f"Example Label: {full_labels[i]}")
print(f"Example Text: {raw_full_texts[i]}")
Example Label: 0
Example Text: "This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."

Reassuringly, at least this example seems properly labeled (recall 0 corresponds to a review labeled as negative).

The data are stored as two numpy arrays:

  1. raw_full_texts contains the movie reviews in raw text format.
  2. full_labels contains the labels.

To run this workflow on your own dataset, you can simply replace raw_full_texts and full_labels above, and continue with the rest of the steps. Your classes (and entries of full_labels) should be represented as integer indices 0, 1, ..., num_classes - 1.

We'll next convert the text strings into index vectors which are better suited as inputs for neural network models: full_texts (click to see code)

Here we first define a function to preprocess the text data by:

  1. Converting it to lower case.
  2. Removing HTML break tags.
  3. Removing any punctuation marks.
import tensorflow as tf
import re
import string

def preprocess_text(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(stripped_html, f"[{re.escape(string.punctuation)}]", "")

We use a TextVectorization layer to preprocess, tokenize, and vectorize our text data, thus making it suitable as input for a neural network.

from tensorflow.keras import layers


max_features = 10000
sequence_length = 250

vectorize_layer = layers.TextVectorization(

Adapting the vectorize_layer to our text data creates a mapping of each token (i.e. word) to a corresponding integer index. Subsequently, we can vectorize our text data via this mapping, and store it as a numpy array.

full_texts = vectorize_layer(raw_full_texts)
full_texts = full_texts.numpy()

Our subsequent neural network models will directly operate on elements of full_texts in order to classify reviews.

Define a text classification model

Here, we build a simple neural network for text classification via the TensorFlow and Keras deep learning frameworks.

from tensorflow.keras import losses, metrics

def get_net():
    net = tf.keras.Sequential(
        [tf.keras.Input(shape=(None,), dtype="int64"),
         layers.Embedding(max_features + 1, 16),

    return net

This network is similar to the fastText model, which is suprisingly effective for many text classification problems despite its simplicity. The inputs to this network will be the elements of full_texts, and its outputs will correspond to the probability that the given movie review should be labeled as class 0 or 1 (i.e. whether it is overall negative or positive).

As some cleanlab features require scikit-learn compatibility, we will adapt the above network accordingly. SciKeras is a convenient package that makes this really easy.

from scikeras.wrappers import KerasClassifier

model = KerasClassifier(get_net(), epochs=10)  # you can increase the number of training epochs to get better results

Once adapted in this manner, the neural net can be interacted with all your favorite scikit-learn model methods such as: fit() and predict(). Now you can train or use the network with just a single line of code!

Compute out-of-sample predicted probabilities

To identify label issues in the data, cleanlab uses probabilistic predictions from a trained classification model. These predictions should specifically be the classifier’s estimate of the conditional probability of each class for a specific example (c.f. sklearn.predict_proba). This blog post on Confident Learning describes the algorithm cleanlab uses to find label issues.

We’ll often want cleanlab to find label issues in all of our data. However, after training a classifier on some of this data, its predictions on the same data may suffer from overfitting. To circumvent this issue, we can instead use K-fold cross-validation to fit our classifier, which enables us to get out-of-sample predicted probabilities for every example in the dataset. These are predictions from a copy of the classifier trained on a dataset that did not contain this example (and thus will not be overfit to this example). The cross_val_predict method used below enables you to easily generate out-of-sample predicted probabilities from any scikit-learn-compatible model.

from sklearn.model_selection import cross_val_predict

num_crossval_folds = 3  # chosen for efficiency here, values like 5 or 10 will generally work better
pred_probs = cross_val_predict(
    model, full_texts, full_labels, cv=num_crossval_folds, method="predict_proba"

An additional benefit of cross-validation is that it facilitates more reliable evaluation of our model than a single training/validation split would.

from sklearn.metrics import log_loss

loss = log_loss(full_labels, pred_probs)  # score to evaluate probabilistic predictions, lower values are better
print(f"Cross-validated estimate of log loss: {loss:.3f}")
Cross-validated estimate of log loss: 0.289

Models with more accurate/calibrated predictions tend to find label errors better when used with cleanlab. Thus we should always try to ensure that our model is reasonably performant. For instance, you may consider Transformer neural networks instead of the very simple network we introduced above. cleanlab works with any model!

Use cleanlab to find potential label errors

Based on the given labels and out-of-sample predicted probabilities, cleanlab can quickly help us identify label issues in our dataset. Here we request that the indices of the identified label issues be sorted by cleanlab’s self-confidence score, which measures the quality of each given label via the probability assigned it in our model’s prediction.

from cleanlab.filter import find_label_issues

ranked_label_issues = find_label_issues(
    labels=full_labels, pred_probs=pred_probs, return_indices_ranked_by="self_confidence"

Let’s review some of the examples cleanlab thinks are most likely to be incorrectly labeled:

    f"cleanlab found {len(ranked_label_issues)} potential label errors. Here are indices of the top 10 most likely errors: \n {ranked_label_issues[:10]}"
cleanlab found 2588 potential label errors.

Here are indices of the top 10 most likely errors:
 [10404 44582 30151 43777 16633 13853 21165 21348 22370 13912]
To inspect these examples, we define a method to print any example from the dataset: print_as_df (click to see code)
import pandas as pd

pd.set_option("display.max_colwidth", None)

def print_as_df(index):
    return pd.DataFrame({"texts": raw_full_texts[index], "labels": full_labels[index]}, [index])

Let’s now inspect some of the top-ranked label issues identified by cleanlab. Below we highlight 3 reviews that are each labeled as positive (1), but should instead be labeled as negative (0).

This movie is stuffed full of stock Horror movie goodies: chained lunatics, pre-meditated murder, a mad (vaguely lesbian) female scientist with an even madder father who wears a mask because of his horrible disfigurement, poisoning, spooky castles, werewolves (male and female), adultery, slain lovers, Tibetan mystics, the half-man/half-plant victim of some unnamed experiment, grave robbing, mind control, walled up bodies, a car crash on a lonely road, electrocution, knights in armour - the lot, all topped off with an incredibly awful score and some of the worst Foley work ever done. The script is incomprehensible (even by badly dubbed Spanish Horror movie standards) and some of the editing is just bizarre. In one scene where the lead female evil scientist goes to visit our heroine in her bedroom for one of the badly dubbed: "That is fantastical. I do not understand. Explain to me again how this is..." exposition scenes that litter this movie, there is a sudden hand held cutaway of the girl's thighs as she gets out of bed for no apparent reason at all other than to cover a cut in the bad scientist's "Mwahaha! All your werewolfs belong mine!" speech. Though why they went to the bother I don't know because there are plenty of other jarring jump cuts all over the place - even allowing for the atrocious pan and scan of the print I saw. The Director was, according to one interview with the star, drunk for most of the shoot and the film looks like it. It is an incoherent mess. It's made even more incoherent by the inclusion of werewolf rampage footage from a different film The Mark of the Wolf Man (made 4 years earlier, featuring the same actor but playing the part with more aggression and with a different shirt and make up - IS there a word in Spanish for "Continuity"?) and more padding of another actor in the wolfman get-up ambling about in long shot. The music is incredibly bad varying almost at random from full orchestral creepy house music, to bosannova, to the longest piano and gong duet ever recorded. (Thinking about it, it might not have been a duet. It might have been a solo. The piano part was so simple it could have been picked out with one hand while the player whacked away at the gong with the other.) This is one of the most bewilderedly trance-state inducing bad movies of the year so far for me. Enjoy. Favourite line: "Ilona! This madness and perversity will turn against you!" How true. Favourite shot: The lover, discovering his girlfriend slain, dropping the candle in a cartoon-like demonstration of surprise. Rank amateur directing there.

Noteworthy snippets extracted from the first review:

  • “…incredibly awful score…”

  • “…worst Foley work ever done.”

  • “…script is incomprehensible…”

  • “…editing is just bizarre.”

  • “…atrocious pan and scan…”

  • “…incoherent mess…”

  • “…amateur directing there.”

This low-budget erotic thriller that has some good points, but a lot more bad one. The plot revolves around a female lawyer trying to clear her lover who is accused of murdering his wife. Being a soft-core film, that entails her going undercover at a strip club and having sex with possible suspects. As plots go for this type of genre, not to bad. The script is okay, and the story makes enough sense for someone up at 2 AM watching this not to notice too many plot holes. But everything else in the film seems cheap. The lead actors aren't that bad, but pretty much all the supporting ones are unbelievably bad (one girl seems like she is drunk and/or high). The cinematography is badly lit, with everything looking grainy and ugly. The sound is so terrible that you can barely hear what people are saying. The worst thing in this movie is the reason you're watching it-the sex. The reason people watch these things is for hot sex scenes featuring really hot girls in Red Shoe Diary situations. The sex scenes aren't hot they're sleazy, shot in that porno style where everything is just a master shot of two people going at it. The woman also look like they are refuges from a porn shoot. I'm not trying to be rude or mean here, but they all have that breast implants and a burned out/weathered look. Even the title, "Deviant Obsession", sounds like a Hardcore flick. Not that I don't have anything against porn - in fact I love it. But I want my soft-core and my hard-core separate. What ever happened to actresses like Shannon Tweed, Jacqueline Lovell, Shannon Whirry and Kim Dawson? Women that could act and who would totally arouse you? And what happened to B erotic thrillers like Body Chemistry, Nighteyes and even Stripped to Kill. Sure, none of these where masterpieces, but at least they felt like movies. Plus, they were pushing the envelope, going beyond Hollywood's relatively prude stance on sex, sexual obsessions and perversions. Now they just make hard-core films without the hard-core sex.

Noteworthy snippets extracted from the second review:

  • “…film seems cheap.”

  • “…unbelievably bad…”

  • “…cinematography is badly lit…”

  • “…everything looking grainy and ugly.”

  • “…sound is so terrible…”

Like the gentle giants that make up the latter half of this film's title, Michael Oblowitz's latest production has grace, but it's also slow and ponderous. The producer's last outing, "Mosquitoman-3D" had the same problem. It's hard to imagine a boring shark movie, but they somehow managed it. The only draw for Hammerhead: Shark Frenzy was it's passable animatronix, which is always fun when dealing with wondrous worlds beneath the ocean's surface. But even that was only passable. Poor focus in some scenes made the production seems amateurish. With Dolphins and Whales, the technology is all but wasted. Cloudy scenes and too many close-ups of the film's giant subjects do nothing to take advantage of IMAX's stunning 3D capabilities. There are far too few scenes of any depth or variety. Close-ups of these awesome creatures just look flat and there is often only one creature in the cameras field, so there is no contrast of depth. Michael Oblowitz is trying to follow in his father's footsteps, but when you've got Shark-Week on cable, his introspective and dull treatment of his subjects is a constant disappointment.

Noteworthy snippets extracted from the third review:

  • “…hard to imagine a boring shark movie…”

  • Poor focus in some scenes made the production seems amateurish.”

  • “…do nothing to take advantage of…”

  • “…far too few scenes of any depth or variety.”

  • “…just look flat…no contrast of depth…”

  • “…introspective and dull…constant disappointment.”

With find_label_issues, cleanlab has shortlisted the most likely label errors to speed up your data cleaning process. You should carefully inspect as many of these examples as you can for potential problems.

Train a more robust model from noisy labels

Manually inspecting and fixing the identified label issues may be time-consuming. At least cleanlab can filter these noisy examples out of the dataset and train a model on the remaining clean data for you automatically.

To demonstrate this, we re-process the dataset into separate train/test splits: train_texts, test_texts (click to see code)
raw_train_ds = tfds.load(name="imdb_reviews", split="train", batch_size=-1, as_supervised=True)
raw_test_ds = tfds.load(name="imdb_reviews", split="test", batch_size=-1, as_supervised=True)

raw_train_texts, train_labels = tfds.as_numpy(raw_train_ds)
raw_test_texts, test_labels = tfds.as_numpy(raw_test_ds)

We featurize the raw text using the same vectorize_layer as before, but first, reset its state and adapt it only on the train set (as is proper ML practice). We finally convert the vectorized text data in the train/test sets into numpy arrays.


train_texts = vectorize_layer(raw_train_texts)
test_texts = vectorize_layer(raw_test_texts)

train_texts = train_texts.numpy()
test_texts = test_texts.numpy()

Let’s now train and evaluate our original neural network model.

from sklearn.metrics import accuracy_score

model = KerasClassifier(get_net(), epochs=10), train_labels)

preds = model.predict(test_texts)
acc_og = accuracy_score(test_labels, preds)
print(f"\n Test acuracy of original neural net: {acc_og}")
Test acuracy of original neural net: 0.8738

cleanlab provides a wrapper class that can easily be applied to any scikit-learn compatible model. Once wrapped, the resulting model can still be used in the exact same manner, but it will now train more robustly if the data have noisy labels.

from cleanlab.classification import CleanLearning

model = KerasClassifier(get_net(), epochs=10)  # Note we first re-instantiate the model
cl = CleanLearning(clf=model, seed=SEED)  # cl has same methods/attributes as model

When we train the cleanlab-wrapped model, the following operations take place: The original model is trained in a cross-validated fashion to produce out-of-sample predicted probabilities. Then these predicted probabilites are used to identify label issues, and the corresponding examples identified to have issues are removed from the dataset. Finally, the original model is trained once more on the remaining clean subset of the data.

_ =, train_labels)

Note that to avoid re-computing the label issues, you can specify the additional fit() argument: label_issues = ranked_label_issues. Providing pre-computed label issues can save significant time by skipping over the cross-validation step straight to the final training of the model on the clean data subset. We could’ve also identified label issues directly via the CleanLearning class without having to manage the previously demonstrated cross-validation steps, simply by executing: cl.find_label_issues(train_text, train_labels).

We can get predictions from the resulting cleanlab-trained model and evaluate them, just like we did for our original neural network. The cleanlab-trained model remains the exact same type of model as our original and can be deployed just as the original would be (should have similar memory/latency).

pred_labels = cl.predict(test_texts)
acc_cl = accuracy_score(test_labels, pred_labels)
print(f"Test acuracy of cleanlab's neural net: {acc_cl}")
Test acuracy of cleanlab's neural net: 0.8755

We can see that the test set accuracy slightly improved as a result of the data cleaning. Note that this will not always be the case, especially if we are evaluating on test data that are themselves noisy. The best practice is to run cleanlab to identify potential label issues and then manually review them, before blindly trusting any accuracy metrics. In particular, the most effort should be made to ensure high-quality test data, which is supposed to reflect the expected performance of our model during deployment.


With one line of code, cleanlab automatically shortlists the most likely label errors to speed up your data cleaning process. Subsequently, you can carefully inspect the examples in this shortlist for potential problems. You can see that even widely-studied datasets like IMDB-reviews contain problematic labels. Never blindly trust your data! You should always check it for potential issues, many of which can be automatically flagged with cleanlab.

A simple way to deal with such issues is to remove examples flagged as potentially problematic from our training data. This can be done automatically via cleanlab’s CleanLearning class. In some cases, this simple procedure can lead to improved ML performance without you having to change your model at all! In other cases, you’ll need to manually handle the issues cleanlab has identified. For example, it may be better to manually fix some examples rather than omitting them entirely from the data. Be particularly wary about label errors lurking in your test data, as test sets guide many important decisions made in a typical ML application. While this post studied movie-review sentiment classification with Tensorflow neural networks, cleanlab can be easily used for any dataset (image, text, tabular, etc.) and with any classification model.

cleanlab is undergoing active development, and we’re always interested in more open-source contributors!

If you want to stay up-to-date on the latest developments from the Cleanlab team, please:

Additional References