The Office-Home Dataset (cited by 600+ papers) contains hundreds of incorrect labels and outliers.

April 21, 2023
  • Chris MauckChris Mauck
  • Jonas MuellerJonas Mueller

Surprise! The Office-Home Dataset (cited by 600+ papers) contains hundreds of erroneous label and data issues found using Cleanlab Studio — an automated solution to find and fix data issues using AI. This CSA (Cleanlab Studio Audit) is our way to inform the community about issues in popular datasets.

Image Classification with the Office-Home Dataset

Here we consider the Office-Home Dataset which has been cited by over 600 research papers in recent years. This dataset contains labeled images of 65 types of items commonly found around the home/office like lamps, computers, and alarm clocks. We discovered hundreds of issues in this famous computer vision dataset just by quickly running it through Cleanlab Studio.

Label errors found within the laptop and computer classes.

Above the first two images are correctly labeled examples (randomly chosen from the dataset) while the latter two are images that Cleanlab Studio automatically identified as mislabeled (also suggesting a more appropriate label to use instead).

According to the author, the Office-Home dataset was “…collected using a python web-crawler…then filtered to ensure that the desired object was in the picture”. This common method for dataset curation often produces incorrect image-label pairs. Let’s look at more problems Cleanlab Studio detected in this dataset.

Mislabeled Examples

Cleanlab Studio found hundreds of examples that are incorrectly labeled. Here we show groups of computers ←→ laptops and chairs ←→ couches. As above, the first two images are correctly labeled examples (randomly chosen from the dataset) while the following two are images that Studio automatically identified as mislabeled. Clearly the distinction between these pairs of classes should be more clearly defined!

Ambiguous Examples

Cleanlab Studio also found many images with issues where multiple labels are appropriate (even though each image is just given one label in the original dataset). We see here a few examples that could be labeled with either of the depicted labels. But in the original dataset, these images are just given the blue label only.

Outliers

Sometimes images aren’t just mislabeled, they should be removed from the dataset entirely as they do not belong to any of the classes of interest. In this dataset, Cleanlab Studio automatically detects many images that are outliers and cannot be appropriately labeled as any of the classes.

Fix data with Cleanlab Studio

Clearly, these data errors detected by the AI in Cleanlab Studio could be detrimental to your modeling and analytics efforts. It’s important to know about such errors in your data and correct them, in order to train the best models and draw the most accurate conclusions.

To find & fix such issues in almost any dataset (text, image, table/CSV/Excel, etc), just run it through Cleanlab Studio. Try this no-code Data-Centric AI tool for free!

If you’ve found interesting issues in any dataset, your findings can be featured in future CSAs if you want to share them with the community! Just fill out this form.

Related Blogs
cleanlab 2.0: Automatically Find Errors in ML Datasets
Announcing cleanlab 2.0: an open-source framework for machine learning and analytics with messy, real-world data.
Read morearrow
Improving Legal Judgement Prediction with Data-Centric AI
A legal sector case study using Cleanlab Studio to produce better models for making predictions (eg. of final judgements) based on court case documents.
Read morearrow
How to Generate Better Synthetic Image Datasets with Stable Diffusion
Systematically evaluate synthetic datasets via quantitative scores. Use these scores to guide prompt engineering and other synthetic data generator optimizations.
Read morearrow
Get started today
Try Cleanlab Studio for free and automatically improve your dataset — no code required.
More resourcesarrow
Explore applications of Cleanlab Studio via blogs, tutorials, videos, and read the research that powers this next-generation platform.
Join us on Slackarrow
Join the Cleanlab Community to ask questions and see how scientists and engineers are practicing Data-Centric AI.