Real-world satellite image datasets are messy and full of issues like labeling errors, duplicated data, outliers, and other ambiguities. Addressing these data issues allows you to produce more reliable ML models and analytics, but can be laborious. This article demonstrates how you can use automated data correction to improve the quality of satellite imagery data such as the RESISC45 dataset – with only a few clicks!
Errors in satellite image datasets can skew scientific research, lead to misguided policies, and result in financial losses for industries like agriculture or urban planning. In critical situations, such as disaster response, inaccurate data can hinder relief efforts and endanger lives. This blog delves into how to automatically find and fix issues in satellite image datasets using Cleanlab Studio, a no-code platform that allows you to improve the quality of your image/text/tabular datasets (and machine learning models) faster than any other tool. The secret? Novel Data-Centric AI methods invented by our scientists that improve the data itself.
Overview
As an example dataset, this article considers the RESISC45 remote sensing dataset, which is a rich compilation of 30,000 images representing 45 scene categories. With over 1,400 Google Scholar citations, it’s celebrated for its diversity and breadth, yet no dataset is entirely devoid of errors. Within moments of auditing this dataset, Cleanlab Studio automatically detected:
- 281 label issues
- 363 outliers
- 20 near duplicates
Mislabeled Examples
Below are a few examples of satellite images that Cleanlab Studio automatically detected to be mislabeled. Notice how in the third example the image does contain a runway
but the correct label in the context of the rest of the dataset is airport
. The correct label is airport
because this image contains buildings, airplanes, service roads, etc that make this an airport instead of just a runway as we can only choose one label. If this dataset was designed as multi-label (explained later) it could be labeled as both airport
and runway
. This is a great example of a human error that is hard to catch yet is easily detected by Cleanlab Studio.
Outliers
Below are a few examples of satellite images that Cleanlab Studio automatically detected to be out of distribution (outlier). Although these appear to be satellite images, none of these images really belong to any of the classes in the dataset.
Duplicate Examples
Below are some examples of satellite images that Cleanlab Studio automatically detected as duplicates. Duplicate examples can be detrimental to model evaluation if the same data points exist in both the training and evaluation data.
Multi-class or multi-label?
One arguable takeaway from these findings is that this dataset could have been designed as a multi-label dataset instead of a multi-class dataset.
Multi-class classification: A single image belongs to exactly one of the classes – the classes are mutually exclusive.
Multi-label classification: A single image can belong to one or more classes simultaneously or none of the classes at all – the classes are not mutually exclusive (each class either applies to the image or not).
Multi-label datasets provide more information per example which means ML engineers and data scientists can build more complex and powerful models. In our case, Cleanlab Studio automatically detected that many images in this dataset have more than one label that applies to them (even though each image in the dataset originally was only annotated with a single label).
Conclusion
Cleanlab Studio is an incredibly powerful and efficient tool for identifying and addressing many types of issues in satellite imagery data. We see here a wide variety of issues that plague datasets that can have harmful downstream consequences, especially with satellite imagery. Cleanlab Studio allows you to automatically find and fix these issues long before they harm your modeling or analytics.
Next Steps
- Automatically find and fix issues in your own satellite imagery data with Cleanlab Studio!
- Join our Community Slack to discuss ways to practice data-centric computer vision.
- Follow us on LinkedIn or Twitter for updates on new techniques to ensure the highest quality data.