Announcing | TLM (Trustworthy Language Model) for reliable LLM outputs.Learn more.

Enhancing Product Analytics and E-commerce with Data-Centric AI

  • Sanjana GargSanjana Garg

A successful E-commerce business relies on having accurate product listings and analytics, both of which depend on a good product taxonomy (categorization). For businesses with large catalogs and sellers offering many products, mistakes naturally appear in product listings due to human/automation error. Such data errors negatively affect the E-commerce website experience (searchability) as well as machine learning and analytics efforts. Now you can easily address such problems using Cleanlab Studio – a no-code AI tool that can automatically identify miscategorized products and those outside the current taxonomy in massive catalogs.


To demonstrate the importance of ensuring high-quality retail data, this article considers a real E-commerce dataset containing images of apparel and footwear products alongside information for each product including: ProductType, SubCategory, Category, Gender, Colour, Usage and Product Title. Here’s how this product dataset is organized:

We quickly run this dataset through Cleanlab Studio, which automatically detects various types of errors and presents them in an intuitive interface for efficient review. Going through the potential data issues and analytics presented in Cleanlab Studio only takes 5 minutes, and reveals obvious ways to improve the E-commerce website, product listings, and analytics. This article shows findings obtained from this dataset, and how this knowledge can help improve retail business processes.

Better Reporting and Decision Making

incorrect flats

The above image shows some examples of miscategorized Women’s Footwear out of the many examples of miscategorized products automatically flagged by Cleanlab Studio. Overall, Cleanlab Studio identified that a whopping 25% of the Flats in this dataset were miscategorized into other categories and that over 20% of the products categorized as Flats were actually Heels.

This miscategorization can lead to serious errors in reporting metrics like the percentage of sales for Heels, which could cascade to larger problems such as misinformed business decisions that were made due to the overestimation or underestimation of product sales.

count change

Retailers frequently base key decisions on the number of products in each category. The figure above shows the percentage change in product counts per-category after we correct the automatically detected errors in the dataset. In certain categories like Sports Sandals and Flip Flops where a large portion of items were miscategorized, these product counts change significantly after data correction. Thus performing analytics on the original, miscategorized dataset could lead to glaringly wrong conclusions. This problem can be prevented by using Cleanlab Studio to first clean the dataset before reporting any analytics.

Increase Discoverability of Products

Within Cleanlab Studio, there is an Analytics tab that provides a summary of miscategorizations in the dataset. The above video shows the number of instances where Casual Shoes are suggested as Sports Shoes, and by clicking on the corresponding bar, you can directly access the specific samples. The analysis in the video reveals that Casual Shoes and Sports Shoes are the two categories that are frequently miscategorized as each other. This suggests these two categories of shoes might be tough to distinguish, both for sellers and buyers. Oftentimes brands also market some shoes to be suitable for both casual wear and specific activities, which further intensifies the difficulty for sellers to accurately categorize these products, and for buyers to find the product they want.

casual vs sports

The above figure shows potentially miscategorized samples of Men’s Footwear found by following the workflow in the previous video. The first two shoes are categorized as Sports Shoes and the next two as Casual Shoes. However, one cannot tell by looking at them which ones are suitable for training, running, or casual wear. For these kinds of cases, it might be useful to have more fine-grained categories like training shoes, running shoes, lifestyle shoes, etc., rather than broad categories like Sports Shoes. Cleanlab Studio can recognize and flag these broad categories that make it painfully difficult for consumers to find what they are looking for. By improving these product categorizations (either manually or automatically with the help of Cleanlab Studio), we can easily improve the discoverability of these products and ultimately sales conversions.


The image presented above illustrates products that have been identified by Cleanlab Studio as being miscategorized based on their color. Color categorization is complex, particularly in the case of the products depicted above, as they all have various shades of blue. Placing each product into a single color category becomes challenging. Consequently, search results may be affected when individuals are specifically looking for a blue t-shirt or shirt. Incorrect indexing due to excessive categorization may result in the products in the lower row not being displayed. Cleanlab Studio can automatically surface these instances for inspection and correction. By addressing these miscategorizations, the customer experience can be improved since users would not have to search through multiple categories to find their desired product. This also enhances their overall experience on the platform.

Boost SEO Efforts

incorrect descriptions

During the process of checking incorrectly categorized products, we also came across products with inaccurate descriptions. Product descriptions play a vital role in the automated cataloging of products, whether it be in machine learning or rule-based categorization systems. Consequently, this can introduce errors in the categorization process. Search engines heavily rely on crawling E-commerce websites including images, and associated text to facilitate faster searches. However, if product descriptions are incorrect, it can have a detrimental effect on the website’s visibility and impede efforts in SEO (Search Engine Optimization). By automatically identifying incorrect product categorizations, Cleanlab Studio can bring attention to instances where website information needs to be rectified.

Improve Customer Experience

exclude example

At times, a product might not fit into any of the provided categories. The image above displays a couple of shoes resembling those for kids, discovered within the Women’s Footwear section. These anomalies can be spotted by looking at the products that were automatically flagged as Outliers by Cleanlab Studio. In such cases, one can reach out to the product’s seller for confirmation that the product is listed and categorized as intended, or flag it as anomalous for recommendation purposes. Encountering these anomalies can undermine customers’ trust and confidence in the website. Indeed a majority of customers drop out on their first visit to the website because they are not confident about their purchase.

More effective advertising campaigns

formal shoes

The above image shows Formal Shoes miscategorized as Casual Shoes identified by Cleanlab Studio. Let’s say a customer has shown interest in Casual Shoes based on their previous purchases and user persona. Incorrect categorization can decrease the likelihood of the right product getting recommended by showing the miscategorized Formal Shoes. This can reduce the effectiveness of targeted marketing. Here, Cleanlab Studio can increase the effectiveness of marketing campaigns by automatically identifying and correcting such issues in the product catalog.

More examples of product categorization errors automatically discovered by Cleanlab Studio (click to view)

How are such insights automatically revealed using Cleanlab Studio?

It took only a few minutes to upload the dataset and get started with our analysis. The above video shows how to upload a dataset with various kinds of metadata like ProductType, ProductTitle, etc. in just a few clicks.

select label

We can easily choose each of the product metadata columns of interest as a label column from a dropdown menu. Cleanlab Studio will automatically detect potentially erroneous values in this column.

cleanlab project page

Once Cleanlab Studio finishes its automated analysis of the dataset using various AI algorithms (indicated by the project status changing to Ready for Review) users can access a straightforward interface to visualize all the identified issues in the dataset. These issues can include label errors, outliers, and more, providing valuable insights into the data quality. The above image demonstrates the intuitive nature of the interface, allowing users to easily explore and understand the detected issues.


Upon discovering multiple errors in the dataset, we utilized the resolver window to correct miscategorized products. The image displayed above shows the resolver window, which suggests that the shoe belongs to the Formal Shoes category with 99% confidence. This resolver window makes it extremely convenient to review all miscategorized products, while the confidence levels offer valuable insights into the ease or difficulty of categorizing different products based on their confidence in specific categories.

label corrections

This article demonstrated Cleanlab Studio as a no-code solution to detect categorization errors in huge product catalogs with just a few steps. While we mostly focused on product categories and other public metadata, retailers can use the same tool to automatically detect errors in internal product metadata like tax categories, age-restriction requirements, pricing/shipping tiers, etc. Correcting such information errors will improve your E-commerce website and data-driven decision-making upon which your business relies.

In just a few minutes, try running your dataset through Cleanlab Studio! It works for multimodal data sources including: product descriptions (text), images (demonstrated in this article), and structured (tabular numeric/categorical) information about each product (weight, price, rating, brand, …).


  • Easily improve your e-commerce websites, product listings, and analytics with Cleanlab Studio
  • Follow us on LinkedIn or Twitter to stay up-to-date on the best data curation tools.