An open-source platform to catch all sorts of issues in all sorts of datasets

February 21, 2024
  • Elías SnorrasonElías Snorrason
  • Jonas MuellerJonas Mueller

We open sourced cleanlab as a Python library to quickly identify dataset problems in any Machine Learning project. While manual issue detection is often done during data prep prior to model training, your trained ML model captures a lot of information about its dataset that can reveal critical issues if the right algorithms are applied. The cleanlab package offers a data-centric AI platform to run many such algorithms and detect common problems in ML datasets like: mislabeling, outliers, (near) duplicates, drift, etc.

Today’s release of cleanlab v2.6 greatly expands the capabilites and usefulness of this library. It’s been a long time coming, and we hope you’ll use cleanlab to quickly audit your dataset in all ML projects. Don’t do all your data checking manually – also use our automated ML algorithms to ensure you don’t miss any problems!

What’s new in v2.6

More comprehensive issue detection in Datalab

cleanlab’s Datalab platform offers a unified audit that simultaneously runs many of our data-quality algorithms on your dataset and labels to catch different types of issues. With v2.6, Datalab can detect a broader set of issues automatically, providing a more thorough analysis of your datasets. Some of the bigger additions include:

  • Flag Null Values: Datalab now automatically points out any missing values in your dataset (particularly flagging rows with entirely missing observations) which are a common source of trouble for ML models.
  • Be aware of Imbalanced Classes: Datalab now automatically alerts you if your classification dataset is imbalanced, so you can pay special attention to minority classes if appropriate.
  • Discover Underperforming Groups: Datalab now automatically identifies subgroups of data that your ML model is struggling to predict accurately. These may correspond to underrepresented groups in the datset, or subpopulations not well-represented by given feature values. These underperforming groups are detected via clustering the data and checking for clusters for which model predictions tend to be unusually poor. You can alternatively specify pre-determined clusters by following our guide, for instance based on meaningful data slices (such as a categorical column that defines natural subpopulations).

Data Valuation. Along with these new audits for dataset problems, Datalab can now optionally run data valuation to identify which data points contribute the most/least to ML model performance. This is achieved via a K Nearest Neighbors approximation of the Data Shapely value. Here’s how to estimate the Data Shapely value of each data point in your classification dataset based on provided vector features (could be model embeddings of the data):

from cleanlab import Datalab

lab = Datalab(data=your_dataset, label_name="column_name_of_labels")
lab.find_issues(features=features, issue_types={"data_valuation": {}})

Unlike other data quality software, you don’t have to write manual rules to detect these issues with cleanlab. Datalab automatically detects many other issue types (mislabeling, outliers, near duplicates, subtle drift, etc.) that data quality rules could anyway not detect, because it requires understanding the information in each data point, which only ML can currently provide.

Multiple ML tasks supported in Datalab

With v2.6.0, the Datalab platform now supports several ML tasks, including:

  • Classification (the default task, tutorial)
  • Regression, i.e. numeric-valued labels (tutorial)
  • Multi-label classification, e.g. document/image tagging (tutorial)

Specifically, the Datalab API now offers a task parameter, which can be set to the ML task you are working on (typically corresponding to how your data is labeled). A workflow for using Datalab for a regression dataset might look like this:

from sklearn.ensemble import HistGradientBoostingRegressor

# Optionally fit any ML model to the data and make predictions:
features = dataset[feature_columns]
model = HistGradientBoostingRegressor().fit(features, dataset["numeric_label_column"])
predictions = model.predict(features)  # use cross-validation in practice to instead have held-out predictions

from cleanlab import Datalab

lab = Datalab(data=dataset, label_name="numeric_label_column", task="regression")
lab.find_issues(features=features, pred_probs=predictions)

lab.report()  # summarize various issues detected in the dataset

Note that the cleanlab package supports many other ML tasks, and you can still achieve a comprehensive audit of almost any dataset even if the ML task is not supported by Datalab yet. In this case, simply run Datalab treating the data as unlabeled, and separately apply cleanlab’s task-specific label issue detection capabilities.

Exploratory data analysis for Object Detection

With this release, cleanlab introduces new functions that help you understand/visualize properties of object detection datasets that may help you improving your dataset. This includes distributions of the:

  • number of annotated and model-predicted objects in each image
  • given class labels associated with bounding boxes
  • sizes of bounding boxes annotated for each class across the dataset.

Viewing these distributions (and the images at their tails) facilitates exploratory data analysis and understanding overall properties of the annotations in your object detection dataset. cleanlab’s object detection tutorial has been updated to showcase this exploratory data analysis.

Other enhancements

We made tons of other enhancements in cleanlab v2.6, such as: better-scaling of outlier and duplicate issue scores, more efficient label issue detection across all ML tasks, better performance in binary classification tasks, …

See the full list of changes in the v2.6 release notes.

A growing community

Thanks to both new and longtime contributors, our Cleanlab community is stronger than ever. Your hard work and creative ideas really make a difference, and we’re thrilled to see how our collaborations have made cleanlab into the most popular software for Data-Centric AI.

We particularly thank the following contributors who made their first code contributions in this release: Samet Taspinar, Abhijit Pal, Pratham Savaliya, OrdoAbChao, Gibson Han, Kyle Gallatin, Ryan Singman, and Reuven Peleg.

We’re always looking for more contributors to help us build the future of open-source Data-Centric AI. Whether it’s a bug report, feature request, or a pull request that you’d like to submit, we’d love to hear from you! To get started, check out our contributing guide.

With help from the commuity, we are continuously improving cleanlab and adding new capabilities to use AI to help you improve your existing datasets. We plan to introduce a more frequent release schedule to ensure you always have access to the latest enhancements.

With cleanlab, we strive to empower data scientists, AI researchers, and engineers with a free/transparent tool to streamline the process of ensuring high-quality data for reliable machine learning. Let’s make Data-Centric AI more useful and accessible to all with cleanlab v2.6 and beyond!

Next Steps

Quickly improve your own Data/AI projects:

Resources to learn more:

Join our Slack community of data-centric scientists/engineers + follow on Twitter & LinkedIn.

Related Blogs
OpenAI's o1 surpassed using the Trustworthy Language Model
See results from using the Trustworthy Language Model to: detect hallucinations/errors from the o1 model and improve its response accuracy.
Read morearrow
Don’t Let Your Messy Documents Run You RAG-Ged. Announcing Document Curation in Cleanlab Studio
Generate AI, not headaches. Automate heterogenous data source curation with Cleanlab document support.
Read morearrow
Datalab: A Linter for ML Datasets
Catch issues in your data/labels. This unified audit uses your ML model to automatically detect various problems in real-world datasets that can be fixed to produce a better model.
Read morearrow
Get started today
Try Cleanlab Studio for free and automatically improve your dataset — no code required.
More resourcesarrow
Explore applications of Cleanlab Studio via blogs, tutorials, videos, and read the research that powers this next-generation platform.
Join us on Slackarrow
Join the Cleanlab Community to ask questions and see how scientists and engineers are practicing Data-Centric AI.