Learn more about Data-Centric AI

Data-Centric AI is the systematic engineering of better data (via AI and automation). Learn about key concepts, useful tricks, and helpful tools.

CROWDLAB: The Right Way to Combine Humans and AI for LLM Evaluation

CROWDLAB: The Right Way to Combine Humans and AI for LLM Evaluation

CROWDLAB improves your team's LLM Evals process by automatically producing reliable ratings and flagging which outputs need further review.

How to detect bad data in your instruction tuning dataset (for better LLM fine-tuning)

How to detect bad data in your instruction tuning dataset (for better LLM fine-tuning)

Overview of automated tools for catching: low-quality responses, incomplete/vague prompts, and other problematic text (toxic language, PII, informal writing, bad grammar/spelling) lurking in a instruction-response dataset. Here we reveal findings for the Dolly dataset.

Automatically Detect Problematic Content in any Text Dataset

Automatically Detect Problematic Content in any Text Dataset

Introducing AI text audits for automated content moderation and curation, including the detection of: toxic, non-English, and informal language, as well as personally identifiable information.

Improving any OpenAI Language Model by Systematically Improving its Data

Improving any OpenAI Language Model by Systematically Improving its Data

Reduce LLM prediction error by 37% via data-centric AI.

ActiveLab: Active Learning with Data Re-Labeling

ActiveLab: Active Learning with Data Re-Labeling

ActiveLab helps you optimally choose which data to (re)label, lowering the cost to train an accurate ML model.

CROWDLAB: Simple and effective algorithms to handle data labeled by multiple annotators

CROWDLAB: Simple and effective algorithms to handle data labeled by multiple annotators

Understanding cleanlab's new methods for multi-annotator data and what makes them effective.

Automatically catching spurious correlations in ML datasets

Automatically catching spurious correlations in ML datasets

An open-source module to detect spurious correlations between dataset labels and features that will not generalize to real-world deployment.

Announcing Auto-Labeling Agent: Your Assistant for Rapid and High Quality Labeling

Announcing Auto-Labeling Agent: Your Assistant for Rapid and High Quality Labeling

Generate AI, not headaches. Automate annotation with AI.

An open-source platform to catch all sorts of issues in all sorts of datasets

An open-source platform to catch all sorts of issues in all sorts of datasets

With cleanlab v2.6, the most popular library for Data-Centric AI now offers more comprehensive data audits including new checks for underperforming groups, null values, imbalanced classes, and more.

Comparing tools for Data Science, Data Quality, Data Annotation, and AI/ML

Comparing tools for Data Science, Data Quality, Data Annotation, and AI/ML

What's the next-generation platform for Data Science? A data-centric AI system that can automatically: find and fix data issues, label data, and train/deploy reliable models.

How to Filter Unsafe and Low-Quality Images from any Dataset: A Product Catalog Case Study

How to Filter Unsafe and Low-Quality Images from any Dataset: A Product Catalog Case Study

Introducing an automated solution to ensure high-quality image data, for both content moderation and boosting engagement. Easily curate any product/content catalog or photo gallery to delight your customers.

Detecting Annotation Errors in Semantic Segmentation Data

Detecting Annotation Errors in Semantic Segmentation Data

Introducing new methods for estimating labeling quality in image segmentation datasets.

Read more blogs.

Learn more from the first-ever course on Data-Centric AI taught at MIT by the Cleanlab team and made freely available.