Using Cleanlab Studio to Audit Public Datasets

The Cleanlab Studio Audit uses AI to auto-detect problems in a given dataset. Like a PSA, the CSA is a recurring series to inform the community about issues in popular datasets — all automatically found and corrected with Cleanlab Studio.

Cleanlab Studio can just as easily help you improve your own image, text, or tabular/CSV/Excel dataset. Try it now!

If you find interesting issues in any dataset, they can be featured here! Just fill out this form.

The Stanford Cars Dataset aka Cars196 (cited in 1000+ papers) contains many Fine-Grained Errors

The Stanford Cars Dataset aka Cars196 (cited in 1000+ papers) contains many Fine-Grained Errors

05/24/2023

The Cleanlab Studio Audit uses AI to auto-detect problems in given data. Here we report issues found in the Stanford Cars196 image classification dataset, which can impair product categorization, product identification, and other business intelligence efforts.

  • Chris MauckChris Mauck
The Office-Home Dataset (cited by 600+ papers) contains hundreds of incorrect labels and outliers.

The Office-Home Dataset (cited by 600+ papers) contains hundreds of incorrect labels and outliers.

04/21/2023

The Cleanlab Studio Audit uses AI to auto-detect problems in given data -- here we report findings for the Office-Home image classification dataset.

  • Chris MauckChris Mauck
  • Jonas MuellerJonas Mueller
Use Cleanlab to Improve LLMs: Find Errors in Human Feedback in the Anthropic RLHF Dataset

Use Cleanlab to Improve LLMs: Find Errors in Human Feedback in the Anthropic RLHF Dataset

04/11/2023

The Cleanlab Studio Audit uses AI to auto-detect problems in given data -- here we report findings for a popular Reinforcement Learning from Human Feedback dataset.

  • Chris MauckChris Mauck
  • Jonas MuellerJonas Mueller