Company updates, tutorials, research, and more!
New algorithms to identify values in a numerical data column that are likely incorrect (eg. due to noise from erroneous sensors, data entry/processing mistakes, imperfect human estimates).
Introducing cleanlab v2.5, the long-awaited release that adds support for practicing Data-Centric AI in ML tasks requested by the most users.
Learn data-centric techniques for better few-shot prompting when applying LLMs to noisy real-world data.
Use AI software to automatically identify mis-categorized legal documents and provide more accurate relevance determination.
Data is the fuel for AI (and Analytics), but is messy in real enterprise applications. Here’s how to use AI to also refine it, allowing your company to build a Data Engine as powerful as those at the heart of today’s biggest tech companies.
A fully-automated analysis of errors in the ImageNet training set.
Introducing an entirely automated solution to: train cutting-edge ML models on raw data, use these models to detect various issues in the data, correct these issues, train better models on the improved data, and deploy them to serve reliable predictions in applications.
Cleanlab Studio for Enterprise launches to automate data curation for LLMs and the modern AI stack with $5 million in seed funding from Bain Capital Ventures.
Use AI to measure the quality of LLM-generated data, automatically detecting unrealistic synthetic examples and underrepresented tails of the real data distribution.
Using AI to analyze product listings for errors, and how this boosts the accuracy of product categorization and analytics efforts.
You may choose suboptimal prompts for your LLM (or make other suboptimal choices via model evaluation) unless you clean your test data.
Reduce LLM prediction error by 37% via data-centric AI.
ActiveLab helps you optimally choose which data to (re)label, lowering the cost to train an accurate ML model.
The Cleanlab Studio Audit uses AI to auto-detect problems in given data. Explore all sorts of issues found in popular datasets by Cleanlab Studio!
Use AI to measure the quality of satellite imagery data, automatically detecting mislabeled examples, outliers, ambiguous examples, and (near) duplicate examples.
How automated quality assurance can help data annotation teams ensure accurate data with less work.
A legal sector case study using Cleanlab Studio to produce better models for making predictions (eg. of final judgements) based on court case documents.
A simple method to determine if a dataset violates the IID assumption in common ways (e.g. temporal drift, or interaction between almost adjacent datapoints).
Use ActiveLab to efficiently choose which data to (re)label to train the best Transformer model.
Catch issues in your data/labels. This unified audit uses your ML model to automatically detect various problems in real-world datasets that can be fixed to produce a better model.
Introducing an open-source Python package to automatically identify common issues in image datasets.
Learn how to easily make any Tensorflow/Keras model compatible with scikit-learn.
Highlighting what's new in cleanlab 2.3
Learn how to reduce prediction errors by 70% using data-centric techniques with cleanlab.
Introducing new data quality algorithms for multi-label classification in cleanlab v2.2
Introducing cleanlab's dual new methods to detect outliers and how they perform on real image data.
Exploring new ways to identify outliers based on probabilistic predictions from a trained classifier.
Understanding cleanlab's new methods for text-based token classification tasks.
Understanding cleanlab's new methods for multi-annotator data and what makes them effective.
Highlighting new features available in cleanlab 2.1
How we built an in-browser visualization of Cleanlab's Confident Learning algorithm.
Learn how to find label issues in text datasets and improve NLP models.
Learn how to find label issues in any audio classification dataset.
Learn how to automatically find label issues in any image classification dataset.
Announcing cleanlab 2.0: an open-source framework for machine learning and analytics with messy, real-world data.
How an MIT grad student project became a company with tech used by Google, Amazon, Tesla, Uber, Facebook, and companies around the world.