cleanlab 2.1 adds Multi-Annotator Analysis and Outlier Detection: toward a broad framework for Data-Centric AI

Note: This article reflects Cleanlab’s earlier work on data quality. Today, Cleanlab is the control layer for AI agents, providing real-time detection and automated remediation to ensure safe, accurate, and compliant outputs.

cleanlab 2.1 is a leap forward toward a standard open-source framework for Data-Centric AI that can be used by engineers and data scientists in diverse applications. cleanlab 2.1 extends cleanlab beyond classification with label errors to several new data-centric ML tasks including:

CleanLearning for finding label issues and training robust ML models on datasets with label errors now works out-of-the-box with many data formats including pandas/pytorch/tensorflow datasets. Often in one line of code, CleanLearning enables dozens of data-centric AI workflows with almost any model and data format – an example using HuggingFace Transformers, Keras, and Tensorflow datasets is available here.

cleanlab v2.1 adds multi-annotator analysis, out of distribution detection, token classification, and CleanLearning support for: pandas, pytorch, tensorflow, keras, and many other data formats + models.

Advancing open-source Data-Centric AI:

Two newsworthy aspects of this release:

cleanlab 2.1 is the most effective Python package to analyze multi-annotator (crowdsourcing) data for annotator and label quality (paper forthcoming).
cleanlab has grown quickly over the last year. cleanlab is the first tool that detects data and label issues in most supervised learning datasets, including: image, text, audio, and token classification. cleanlab 2.1 is also useful for other core data-centric tasks like: out of distribution detection, dataset curation, and robust learning with noisy labels.

Major new functionalities added in 2.1:

CROWDLAB algorithms for analyzing data labeled by multiple annotators to:
- Accurately infer the best consensus label for each example in your dataset
- Estimate the quality of each consensus label (how likely is it correct)
- Estimate the quality of each annotator (how trustworthy are their suggested labels)
Out of Distribution Detection based on either:
- Feature values/embeddings
- Predicted class probabilities
Label error detection for Token Classification
- Supports NLP tasks like entity recognition
CleanLearning can now:
- Run on non-array data types including: pandas Dataframe, pytorch/tensorflow Datasets
- Utilize any Keras model (supporting sequential and functional APIs)

Other developer-focused improvements:

Added an FAQ with advice for common questions
Added many additional tutorial and example notebooks at: docs.cleanlab.ai and github.com/cleanlab/examples
Reduced dependencies: e.g. scipy is no longer needed

Code Examples and New Workflows in cleanlab 2.1:

1. Detect out of distribution examples

Detect out of distribution examples in a dataset based on its numeric feature embeddings

python

0
1
2
3
4
5
6
7
8
9
from cleanlab.outlier import OutOfDistribution
ood = OutOfDistribution()
# To get outlier scores for train_data using feature matrix train_feature_embeddingsood_train_feature_scores = ood.fit_score(features=train_feature_embeddings)
# To get outlier scores for additional test_data using feature matrix test_feature_embeddingsood_test_feature_scores = ood.score(features=test_feature_embeddings)

Detect out of distribution examples in a dataset based on predicted class probabilities from a trained classifier

python

0
1
2
3
4
5
6
7
8
9
from cleanlab.outlier import OutOfDistribution
ood = OutOfDistribution()
# To get outlier scores for train_data using predicted class probabilities (from a trained classifier) and given class labelsood_train_predictions_scores = ood.fit_score(pred_probs=train_pred_probs, labels=labels)
# To get outlier scores for additional test_data using predicted class probabilitiesood_test_predictions_scores = ood.score(pred_probs=test_pred_probs)

2. Multi-annotator data

For data labeled by multiple annotators (stored as matrix multiannotator_labels whose rows correspond to examples, columns to each annotator’s chosen labels), cleanlab 2.1 can: find improved consensus labels, score their quality, and assess annotators, all by leveraging predicted class probabilities pred_probs from any trained classifier.

python

0
1
2
3
from cleanlab.multiannotator import get_label_quality_multiannotator
get_label_quality_multiannotator(multiannotator_labels, pred_probs)

3. Entity Recognition and Token Classification

cleanlab 2.1 can now find label issues in token classification (text) data, where each word in a sentence is labeled with one of K classes (eg. entity recognition).

python

0
1
2
3
4
5
6
from cleanlab.token_classification.filter import find_label_issuesfrom cleanlab.token_classification.summary import display_issues
issues = find_label_issues(per_token_labels, per_token_pred_probs)display_issues(issues, tokens, pred_probs= per_token_pred_probs, given_labels= per_token_labels,               class_names=optional_list_of_ordered_class_names)

Example inputs (for dataset with K=2 classes) might look like this:

python

0
1
2
3
4
tokens = [..., ["I", "love", "cleanlab"], ...]per_token_labels = [..., [1, 0, 0], ...]per_token_pred_probs = [..., np.array([[0.8,0.2], [0.9,0.1], [0.3,0.7]]), ...]  # predictions from modeloptional_list_of_ordered_class_names = ["not-person", "person"]

Running this code on the CoNLL-2003 named entity recognition dataset uncovers many label errors, such as the following sentence:

Little change from today’s weather expected.

where Little is wrongly labeled as a PERSON entity in CoNLL.

4. CleanLearning can now operate directly on non-array dataset formats like tensorflow/pytorch datasets and use Keras models

python

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as npimport tensorflow as tffrom cleanlab.experimental.keras import KerasWrapperModel
dataset = tf.data.Dataset.from_tensor_slices((features_np_array, labels_np_array))  # example tensorflow dataset created from numpy arraysdataset = dataset.shuffle(buffer_size=len(features_np_array)).batch(32)
def make_model(num_features, num_classes):    inputs = tf.keras.Input(shape=(num_features,))    outputs = tf.keras.layers.Dense(num_classes)(inputs)    return tf.keras.Model(inputs=inputs, outputs=outputs, name="my_keras_model")
model = KerasWrapperModel(make_model, model_kwargs={"num_features": features_np_array.shape[1], "num_classes": len(np.unique(labels_np_array))})cl = CleanLearning(model)cl.fit(dataset, labels_np_array)  # variant of model.fit() that is more robust to noisy labelsrobust_predictions = cl.predict(dataset)  # equivalent to model.predict() after training on cleaner data

More details in the official release notes.

Beyond cleanlab v2.1

While cleanlab 2.1 finds data issues, an interface is needed to efficiently fix these issues your dataset. Cleanlab Studio finds and fixes errors automatically in a (very cool) no-code platform. Export your corrected dataset in a single click to train better ML models on better data.

Try Cleanlab Studio at https://studio.cleanlab.ai/.

Learn more about Cleanlab

How Google, Tencent, and others use Cleanlab.
Step-by-step tutorials to find issues in your data and train robust ML models:
- Image | Text | Audio | Outliers | Dataset Curation | Multi-annotator Data
Ways to try out Cleanlab:
- Open-source: GitHub
- No-code, automatic platform (easy mode): Cleanlab Studio
Documentation | Blogs | Research Publications | Cleanlab History | Team

Join our community of scientists and engineers to help build the future of open-source Data-Centric AI: Cleanlab Slack Community

Contributors

A big thank you to the data-centric jedi who contributed code for cleanlab 2.1 (in no particular order): Aravind Putrevu, Jonas Mueller, Anish Athalye, Johnson Kuan, Wei Jing Lok, Caleb Chiam, Hui Wen Goh, Ulyana Tkachenko, Curtis Northcutt, Rushi Chaudhari, Elías Snorrason, Shuangchi He, Eric Wang, and Mattia Sangermano.

We thank the individuals who contributed bug reports or feature requests. If you’re interested in contributing to cleanlab, check out our contributing guide!

Browse all Next

Learn how to automatically find label issues in any image classification dataset.

Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?

A comprehensive benchmark of evaluation models to automatically catch incorrect responses across five RAG applications.

Reliable Agentic RAG with LLM Trustworthiness Estimates

Ensure reliable answers in Retrieval-Augmented Generation, while also ensuring that latency and compute costs do not exceed the processing needed to accurately respond to complex queries.

cleanlab 2.1 adds Multi-Annotator Analysis and Outlier Detection: toward a broad framework for Data-Centric AI

Advancing open-source Data-Centric AI:

Major new functionalities added in 2.1:

Code Examples and New Workflows in cleanlab 2.1:

1. Detect out of distribution examples

2. Multi-annotator data

3. Entity Recognition and Token Classification

4. CleanLearning can now operate directly on non-array dataset formats like tensorflow/pytorch datasets and use Keras models

Beyond cleanlab v2.1

Learn more about Cleanlab

Contributors

Platform

Resources

Community

Company