Whisking Away Errors: How Cleanlab Studio Served Up Fixes for the Food-101N Computer Vision Dataset

September 11, 2023
  • Chris MauckChris Mauck

CSA? Like a PSA, the CSA (Cleanlab Studio Audit) is our way of informing the community about issues in popular datasets. To glean insights about popular datasets, we quickly run them through Cleanlab Studio — an automated solution to find and fix data issues using AI.

This thumbnail provides an overview of all of the issues that Cleanlab Studio found in the Food-101N dataset.

Here we consider the Food-101N dataset: a variant of the Food-101 dataset that has been used in over 400 publications. This dataset contains 101k images with 101 food categories such as carrot_cake , ice_cream , and waffles. We discovered thousands of label issues, outliers, ambiguous examples, and (near) duplicates in this famous computer vision dataset just by quickly running it through Cleanlab Studio.

This image shows an example of a label error with a piece of carrot cake labeled as cheesecake.

Above we see one example of an image that Cleanlab Studio automatically identified as mislabeled (while also suggesting a more appropriate label to use instead). This food item was labeled as a piece of cheesecake when clearly it is a piece of carrot cake. The video below displays the top detected label issues in Cleanlab Studio’s web interface – you can see that the Given label in the original data for most of these images in incorrect and Cleanlab Studio has automatically Suggested a more appropriate label to consider for these images.

The authors of this dataset noted that “[Food-101N]…has [many] more images and is more noisy…” than the original Food-101 dataset. While this disclaimer is a step in the right direction, Cleanlab Studio provides you with certainty regarding your dataset’s quality and equips you with the tools to address these issues if necessary. Interestingly, the Food101-N dataset authors did not mention occurrences of outliers, ambiguous instances, or duplicate examples in their disclaimer, which, as we will later demonstrate, are noticeably prevalent in the dataset.

27,488 Mislabeled Examples Found

Cleanlab Studio automatically found thousands of examples that were incorrectly labeled and suggested the correct labels. Below we show two examples of food items that are labeled wrong.

This image shows an example of many label errors found with Cleanlab Studio.

8,519 Outliers Found

Sometimes images should be removed from the dataset entirely as they do not belong to any of the target classes. Below are some examples Cleanlab Studio automatically detected as outliers that cannot be correctly labeled as any of the classes and should be removed from the dataset entirely.

This image shows an example of many outliers found with Cleanlab Studio.

13,538 Ambiguous Examples Found

Cleanlab Studio also found many images that are ambiguous. These are images where the true label is unclear as there could be more than one correct label or no label perfectly applies. You can see below two examples of ambiguous images.

This image shows an example of many ambiguous examples found with Cleanlab Studio.

17,510 (Near) Duplicate Examples Found

Cleanlab Studio found many duplicate images, where one image was labeled correctly and its duplicate was not. Duplicate images with different labels could cause problems.

This image shows an example of many (near) duplicate examples found with Cleanlab Studio.

Fix data with Cleanlab Studio

This image shows the Cleanlab Studio web interface.

These data errors detected by the AI in Cleanlab Studio are detrimental to your modeling and analytics efforts. It’s important to know about such errors in your data and correct them in order to train the best models and draw the most accurate conclusions. To find & fix such issues in almost any dataset (text, image, table/CSV/Excel, etc), just run it through Cleanlab Studio!

If you’ve found interesting issues in any dataset, your findings can be featured in future CSAs if you want to share them with the community! Just fill out this form.

Related Blogs
Detecting Errors in Numerical Data via any Regression Model
New algorithms to identify values in a numerical data column that are likely incorrect (eg. due to noise from erroneous sensors, data entry/processing mistakes, imperfect human estimates).
Read morearrow
Detecting Label Errors in Entity Recognition Data
Understanding cleanlab's new methods for text-based token classification tasks.
Read morearrow
Letter from the CEO: Announcing Our Seed Funding and the Launch of Cleanlab Studio for Enterprise
Cleanlab Studio for Enterprise launches to automate data curation for LLMs and the modern AI stack with $5 million in seed funding from Bain Capital Ventures.
Read morearrow
Get started today
Try Cleanlab Studio for free and automatically improve your dataset — no code required.
More resourcesarrow
Explore applications of Cleanlab Studio via blogs, tutorials, videos, and read the research that powers this next-generation platform.
Join us on Slackarrow
Join the Cleanlab Community to ask questions and see how scientists and engineers are practicing Data-Centric AI.