Open Source
New Feature

cleanlab now supports all major ML tasks — including Regression, Object Detection, and Image Segmentation

09/14/2023
  • Chris MauckChris Mauck
  • Curtis NorthcuttCurtis Northcutt
  • Jonas MuellerJonas Mueller

For over half a decade, we’ve been increasing the coverage of ML tasks that cleanlab supports, in response to user requests. Today we are proud to share that cleanlab directly supports improving the reliability/accuracy of datasets and models for all major ML tasks. We just released v2.5 of the package, which adds support for regression— yes, cleanlab supports data with real-valued targets now, not just class labels! (nod to the dozens of users who have requested that feature over the years). With v2.5, cleanlab is now the only package to support label error detection for both object detection and image segmentation as well, enabling users to find issues in their datasets for every major ML task across most data types and modalities.

It’s been a long time coming, and we hope you have some fun with v2.5!

All of the ML tasks supported via cleanlab package.

All features of cleanlab work with any dataset and any model. Here’s a summary of what cleanlab can do:

ML Tasks Supported

What’s New in v2.5

Regression

Although it’s brand new, Cleanlab’s regression functionality has already been used to successfully improve data quality in Kaggle competitions. Specifically, a new regression module has been introduced for label error detection in regression datasets like the one below.

Learn how to apply it to your own regression data within 5 minutes via the Find Noisy Labels in Regression Datasets quickstart tutorial. Here’s all the code needed to detect erroneous values in a regression dataset and fit a more robust regression model with the corrupted data automatically filtered out:

from cleanlab.regression.learn import CleanLearning

cl = CleanLearning(model)
cl.fit(X, y)
label_issues = cl.get_label_issues()
preds = cl.predict(X_test)  # predictions from a version of your model trained on auto-cleaned data

Object Detection

The new object_detection module can automatically detect annotation errors in object detection datasets. This includes images with: incorrect class labels annotated for bounding boxes, objects entirely overlooked by annotators, as well as poorly-drawn bounding boxes which do not optimally localize an object.

Learn how to apply this to your own object detection data within 5 minutes via the Finding Label Errors in Object Detection Datasets quickstart tutorial. Here’s all the code needed to detect errors in an object detection dataset or score images by their labeling quality:

from cleanlab.object_detection.filter import find_label_issues
from cleanlab.object_detection.rank import get_label_quality_scores

# Estimate which images have label issues (boolean vector) based on predictions from trained object detector
has_label_issue = find_label_issues(labels, predictions)

# Estimate label quality scores for all images
label_quality_scores = get_label_quality_scores(labels, predictions)

Image Segmentation

Cleanlab now also supports image segmentation tasks in just a few lines of code via the new semantic_segmentation module. The package can automatically detect many forms of annotator error which is common in such datasets because they require laborious labeling of each individual pixel.

Learn how to apply this to your own object detection data within 5 minutes via the Find Label Errors in Semantic Segmentation Datasets quickstart tutorial. Here’s all the code needed to detect errors in a segmentation dataset or score images by their labeling quality:

from cleanlab.semantic_segmentation.filter import find_label_issues
from cleanlab.segmentation.rank import get_label_quality_scores

# Estimate which images have label issues (boolean vector) based on predicted probabilties from trained segmentation model
images_w_issues = find_label_issues(labels, pred_probs)

# Estimate label quality scores for all images/pixels
image_scores, pixel_scores = get_label_quality_scores(labels, pred_probs)

Proper support for Data-Centric AI in regression, image segmentation, and object detection tasks required significant research and novel algorithms developed by our scientists. We have published papers on these for transparency and scientific rigor. Check them out to learn the math behind these new methods and see comphrensive benchmarks showing how effective they are on real datasets. Or stay tuned for upcoming blogposts summarizing all of the research behind these new methods!

Other Major Improvements in v2.5

  • Datalab Enhancements: Datalab is a general platform for detecting all sorts of common issues in real-world data, and the best place to get started for running this package on your datasets. Through the integration of CleanVision, Datalab can now also detect low-quality images (see above) such as those which are: blurry, over/under-exposed, low information, oddly-sized, etc.
  • Imbalanced and Unlabeled Data: Datalab can now also flag rare classes in imbalanced classification datasets and even audit unlabeled datasets.
  • Out-of-Distribution Detection: Added GEN algorithm which can detect out-of-distribution data based on pred_probs more effectively in datasets with tons of classes.
  • Speed and Efficiency: A 50x speedup has been achieved in the cleanlab.multiannotator code to enable quicker analysis of data labeled by multiple annotators. The memory requirements for methods across the package have been greatly reduced via algorithmic tricks as well.

Many additional improvements were made in this update, check out the release notes for a complete list.

Easy mode: Cleanlab Studio

While our open-source library finds issues in your dataset via your own ML model, its utility depends on you having a proper ML model and efficient interface to fix these data issues. Providing all of these pieces, Cleanlab Studio is a no-code platform to find and fix problems in real-world datasets. Beyond data curation & correction, Cleanlab Studio additionally automates almost all of the hard parts of turning raw data into reliable ML or Analytics:

  1. Labeling (and re-labeling) data
  2. Training a baseline ML model
  3. Diagnosing and correcting data issues via an efficient interface
  4. Identifying the best ML model for your data and training it
  5. Deploying the model to serve predictions in a business application

All of these steps are easily completed with a few clicks in the Cleanlab Studio application for text, tabular and image datasets (depicted below). Behind the scenes is an AI system that automates all of these steps wherever it confidently can.

Try Cleanlab Studio for free!

A Big Thank You to Contributors

Transforming cleanlab into the most popular data-centric AI package today has been no small feat. A heartfelt appreciation goes out to all contributors. Your relentless hard work and dedication are what makes Cleanlab shine, and directly helps the thousands of data/ML scientists using this package.

An especially big thank you to those who made their first code contributions in this release: Vedang Lad, Gordon Lim, Angela Liu, Chenhe Gu, Steven Yiran, Tata Ganesh (who also contributed to our blog).

If you feel there’s another ML task that Cleanlab should support or have other suggestions, your feedback is invaluable. We are always looking for more code contributors to help us build the future of open-source Data-Centric AI, you can easily get started via our contributing guide!

Next Steps

Quickly improve your own Data/AI with Cleanlab

Join our community of data-centric scientists and engineers

Resources to learn more

Let’s make AI more data-centric with Cleanlab v2.5 and beyond!

Related Blogs
Automatically Find and Fix Issues in Image/Document Tags and other Multi-Label Datasets
In this tutorial, learn how to use Cleanlab Studio to automatically correct multi-label classification data for image and document tagging, content curation, NLP, and more!
Read morearrow
CROWDLAB: Simple and effective algorithms to handle data labeled by multiple annotators
Understanding cleanlab's new methods for multi-annotator data and what makes them effective.
Read morearrow
Training Transformer Networks in Scikit-Learn?!
Learn how to easily make any Tensorflow/Keras model compatible with scikit-learn.
Read morearrow
Get started today
Try Cleanlab Studio for free and automatically improve your dataset — no code required.
More resourcesarrow
Explore applications of Cleanlab Studio via blogs, tutorials, videos, and read the research that powers this next-generation platform.
Join us on Slackarrow
Join the Cleanlab Community to ask questions and see how scientists and engineers are practicing Data-Centric AI.