Getting Started with Cleanlab Studio

Welcome to Cleanlab Studio! This starter guide will show you what an end-to-end use of Cleanlab Studio looks like. At a high level, the workflow is:

Upload a dataset
Create a project
Review the errors
Export a cleanset (the cleaned dataset)

Cleanlab Studio is the best tool for turning your unreliable data into reliable models and insights! Currently, Cleanlab Studio supports multi-class classification (where each example is labeled as exactly 1 of K classes) and multi-label classification (also known as tagging, where each example is labeled as one or more of K classes). Support for other ML tasks like regression, object detection, and image segmentation is coming soon!

Demo Datasets

Cleanlab Studio comes pre-loaded with several demo datasets and projects, so you can check those out in your account after signing in. Alternatively, if you’d like to go through the entire workflow using a demo dataset, here are some you can use to get started:

Tabular/CSV: grades-tabular-demo.csv
Text: amazon-text-demo.csv
Image: mnist.tar.gz (please unzip this folder before uploading it to Cleanlab Studio)
This tutorial’s dataset: tweets.csv

Upload a dataset

Cleanlab Studio offers a variety of ways to upload your data to best suit your needs. You can upload data from your computer, via URL, via command line, or via our Python API. We also offer Data Warehouse and Cloud Storage options for Enterprise Users!

How to Format Your Dataset

Cleanlab Studio supports text, image, and tabular datasets (we’re constantly adding new modalities) in multiple formats. The free tier supports CSV, JSON, Excel, Dataframes, and Zip Files. If you’re uploading your own dataset, you can use the How to format your dataset wizard on the upload page to get a walkthrough of the best way to upload your specific type of data.

Once you upload a dataset…

Cleanlab Studio will automatically infer its schema (the data types and feature types of all the fields). You can review this schema and make any corrections before clicking “Confirm schema.”

Cleanlab Studio will then analyze and display any processing issues or missing data cells in your dataset immediately.

Create a Project to Find and Correct Outliers and Label Issues

To analyze a dataset for errors, you first need to create a project. There are a few options to configure when creating a project:

Machine Learning Task: What type of task are you training a model to accomplish? Currently, we support Classification for text, tabular and image datasets.
Type of Classification: Classification tasks are either multi-class (a datapoint is assigned to 1 of K classes) or multi-label (each datapoint can be part of 0 to K of K classes).
Label Column: The column in your dataset that you want us to find label errors in and suggest label corrections on.
Predictive Columns / Text Column: Depending on your machine learning task, you’ll be asked to specify which columns we should use to train our classification models on.
Model Type: Fast mode trains and suggests corrections quicker, but may produce lower quality results, while regular mode will give the best results but could take up to 24 hours for large datasets.

We’ll send you an email when analysis is complete.

Review the errors

Cleanlab Studio automatically flags examples that have label issues or data errors. For each identified label issue, Cleanlab Studio suggests a better label that may be more appropriate for this example. You can accept all of the suggestions with the “Auto-fix top issues” button, but for best results, we recommend reviewing the flagged issues. Cleanlab Studio’s label error correction interface makes this easy — data is ranked by quality so that your time is spent on the data that needs review (no need to review already-clean data). Additionally, you can choose to exclude data points — which is our recommended action for outliers: data points that appear to not be part of any of the classes in your dataset.

After reviewing flagged issues and taking actions to fix the data, you have produced a cleaned version of your dataset, which we call a cleanset.

Project Analytics

Along with the review interface, Cleanlab Studio offers analytics information about the label errors and data issues in your dataset. From the Analytics tab, you can view information about the classes with the most label issues, the corrections that we most commonly suggest, and more! This tab can give you a high level summary of your data, or offer a direct view into the specific issues that we’ve detected — click on any bar or square in the Analytics chart and we’ll show you the exact examples that that chart represents.

Export a cleanset

Cleanlab Studio supports exporting a cleanset from the web app or the cleanlab-studio CLI / Python client library. The cleaned labels are available in the cleanlab_corrected_label column. The export also includes other metadata from Cleanlab Studio; look through the column headers to see what’s there. Data columns generated by Cleanlab Studio are prefixed with cleanlab_.

You can also re-run Cleanlab on the cleanset. This will re-analyze the new version of your dataset, now with ML models trained on the cleaner data which will often give even better results.

Need help using Cleanlab Studio?

As one of our early users, we’re inviting you to get direct support from our engineering team over Slack. Once you join the Cleanlab Community Slack, we’ll add you to a support channel for you and/or your company where you can ask our engineers questions directly. You can also contact support via email at support@cleanlab.ai.