Cleanlab Studio
Educational

Automated Correction of Satellite Imagery Data

09/20/2023
  • Chris MauckChris Mauck
  • Aditya ThyagarajanAditya Thyagarajan

Real-world satellite image datasets are messy and full of issues like labeling errors, duplicated data, outliers, and other ambiguities. Addressing these data issues allows you to produce more reliable ML models and analytics, but can be laborious. This article demonstrates how you can use automated data correction to improve the quality of satellite imagery data such as the RESISC45 dataset – with only a few clicks!

Errors in satellite image datasets can skew scientific research, lead to misguided policies, and result in financial losses for industries like agriculture or urban planning. In critical situations, such as disaster response, inaccurate data can hinder relief efforts and endanger lives. This blog delves into how to automatically find and fix issues in satellite image datasets using Cleanlab Studio, a no-code platform that allows you to improve the quality of your image/text/tabular datasets (and machine learning models) faster than any other tool. The secret? Novel Data-Centric AI methods invented by our scientists that improve the data itself.

Overview

As an example dataset, this article considers the RESISC45 remote sensing dataset, which is a rich compilation of 30,000 images representing 45 scene categories. With over 1,400 Google Scholar citations, it’s celebrated for its diversity and breadth, yet no dataset is entirely devoid of errors. Within moments of auditing this dataset, Cleanlab Studio automatically detected:

  • 281 label issues
  • 363 outliers
  • 20 near duplicates

Mislabeled Examples

Below are a few examples of satellite images that Cleanlab Studio automatically detected to be mislabeled. Notice how in the third example the image does contain a runway but the correct label in the context of the rest of the dataset is airport. The correct label is airport because this image contains buildings, airplanes, service roads, etc that make this an airport instead of just a runway as we can only choose one label. If this dataset was designed as multi-label (explained later) it could be labeled as both airport and runway. This is a great example of a human error that is hard to catch yet is easily detected by Cleanlab Studio.

Outliers

Below are a few examples of satellite images that Cleanlab Studio automatically detected to be out of distribution (outlier). Although these appear to be satellite images, none of these images really belong to any of the classes in the dataset.

Duplicate Examples

Below are some examples of satellite images that Cleanlab Studio automatically detected as duplicates. Duplicate examples can be detrimental to model evaluation if the same data points exist in both the training and evaluation data.

Multi-class or multi-label?

One arguable takeaway from these findings is that this dataset could have been designed as a multi-label dataset instead of a multi-class dataset.

Multi-class classification: A single image belongs to exactly one of the classes – the classes are mutually exclusive.

Multi-label classification: A single image can belong to one or more classes simultaneously or none of the classes at all – the classes are not mutually exclusive (each class either applies to the image or not).

Multi-label datasets provide more information per example which means ML engineers and data scientists can build more complex and powerful models. In our case, Cleanlab Studio automatically detected that many images in this dataset have more than one label that applies to them (even though each image in the dataset originally was only annotated with a single label).

Conclusion

Cleanlab Studio is an incredibly powerful and efficient tool for identifying and addressing many types of issues in satellite imagery data. We see here a wide variety of issues that plague datasets that can have harmful downstream consequences, especially with satellite imagery. Cleanlab Studio allows you to automatically find and fix these issues long before they harm your modeling or analytics.

Next Steps

  • Automatically find and fix issues in your own satellite imagery data with Cleanlab Studio!
  • Join our Community Slack to discuss ways to practice data-centric computer vision.
  • Follow us on LinkedIn or Twitter for updates on new techniques to ensure the highest quality data.
Related Blogs
Ensure high-quality data quickly via AI validation of which data is Well Labeled
How automated quality assurance can help data annotation teams ensure accurate data with less work.
Read morearrow
Most AI & Analytics are impaired by data issues. Now AI can help you fix them.
Data is the fuel for AI (and Analytics), but is messy in real enterprise applications. Here’s how to use AI to also refine it, allowing your company to build a Data Engine as powerful as those at the heart of today’s biggest tech companies.
Read morearrow
How to detect bad data in your instruction tuning dataset (for better LLM fine-tuning)
Overview of automated tools for catching: low-quality responses, incomplete/vague prompts, and other problematic text (toxic language, PII, informal writing, bad grammar/spelling) lurking in a instruction-response dataset. Here we reveal findings for the Dolly dataset.
Read morearrow
Get started today
Try Cleanlab Studio for free and automatically improve your dataset — no code required.
More resourcesarrow
Explore applications of Cleanlab Studio via blogs, tutorials, videos, and read the research that powers this next-generation platform.
Join us on Slackarrow
Join the Cleanlab Community to ask questions and see how scientists and engineers are practicing Data-Centric AI.