Back to blog
Open Source
Tutorial

Finding Label Issues in Audio Classification Datasets

April 27, 2022
  • Johnson KuanJohnson Kuan
  • Jonas MuellerJonas Mueller
  • Anish AthalyeAnish Athalye

In June 2021, Forbes published an article on the movement towards Data-Centric AI which revolves around the insight that improving the data rather than the model can be more effective in improving the overall performance of AI systems.

Intuitively this makes sense because the quality of your Machine Learning (ML) models depends on the quality of the data used to train/evaluate them. Garbage in, garbage out. Moreover, given the abundance of awesome open-source ML modeling packages out there, the model aspect is more-or-less a solved problem for many business applications. A key challenge is how to make Data-Centric AI an efficient and systematic process. Therein lies the need for new tools focused on data quality for AI.

One such tool is cleanlab, the leading open-source package to automatically find label issues in any dataset. Cleanlab is powered by an algorithm called “Confident Learning”, whose output is provably consistent with the underlying label errors as researchers from MIT/Google proved in a theoretical analysis. This tool has been used to uncover thousands of label errors in the top 10 ML benchmark datasets (e.g. ImageNet, AudioSet). If even these extremely well-studied datasets contain bad labels, the data from your applications likely do too!

This blog demonstrates how to use cleanlab to find label issues in audio datasets used for supervised learning. As an example, we’ll use the Spoken Digit dataset (it’s like MNIST for audio), which contains 2,500 audio clips with English pronunciations of the digits 0 to 9. These are the labels we’d like to train a classifier to predict from the raw audio signals. Here are some examples of label issues found by cleanlab where the given label is 6:

6_yweweler_14.wav
6_yweweler_35.wav
6_nicolas_8.wav

This post will show you step-by-step how to run cleanlab to find these issues and more in the Spoken Digit dataset. You can use the same cleanlab workflow demonstrated here to easily find bad labels in your own dataset. To run this workflow yourself in under 5 minutes, check out:

Overview of the steps to find label issues with cleanlab

Before diving into the code, let’s see what our label error detection workflow looks like with cleanlab.

  1. We start with a labeled dataset of raw audio clips (.wav files), where some of the given annotations may be incorrect. For example, the below audio clip of the spoken digit 8 is erroneously labeled 6.
  1. Next, we select a classification model for the data. In this case, our model is a linear output layer trained on extracted features (aka embeddings) from audio clips (.wav files) obtained via a pre-trained Pytorch model that was previously fit to the VoxCeleb speech dataset.

  2. We then use cross-validation to train our model and compute out-of-sample predicted probabilities for every example in our dataset.

  3. Finally, we run one line of cleanlab code on these predicted probabilities to identify which audio clips may be badly labeled.

For the Spoken Digit example above, cleanlab is able to automatically detect that this audio clip has an incorrect label of 6. The rest of this blog dives into the code implementing this workflow.

Show me the code

We'll start by installing and importing some required packages (click to see code)Accordion Arrow

You can use pip to install the dependencies for this workflow.

pip install cleanlab speechbrain tensorflow_io tensorflow pandas sklearn

Our first Python commands will be to import some of the required packages, set some configurations for better-looking output, and set seeds for reproducibility.

import numpy as np
import random
import os
import torch
import tensorflow as tf

SEED = 456

def set_seed(seed=0):
    """Ensure reproducibility."""
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.cuda.manual_seed_all(seed)

set_seed(SEED)
pd.options.display.max_colwidth = 500
tf.get_logger().setLevel('ERROR')  # suppress TF warnings

Prepare the dataset

Next we’ll download the Spoken Digit dataset into a folder: spoken\_digits/ (click to see code)Accordion Arrow
!wget https://github.com/Jakobovski/free-spoken-digit-dataset/archive/v1.0.9.tar.gz
!mkdir spoken_digits
!tar -xf v1.0.9.tar.gz -C spoken_digits

If you don’t have wget installed, you can directly download the data by navigating to the link above.

Let’s now collect all the audio (.wav) file names in a single list: file_paths (click to see code)Accordion Arrow
DATA_PATH = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/"

# Get list of .wav file names

# os.listdir order is nondeterministic, so for reproducibility,
# we sort first and then do a deterministic shuffle
file_names = sorted(i for i in os.listdir(DATA_PATH) if i.endswith(".wav"))
random.Random(SEED).shuffle(file_names)

file_paths = [os.path.join(DATA_PATH, name) for name in file_names]

# Check out first 3 files
file_paths[:3]

Here’s what our first 3 data files look like:

['spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_george_26.wav',
 'spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_24.wav',
 'spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_6.wav']

Note that the label (digits from 0 to 9) is indicated in the prefix of the file name (e.g. audio clip “6_nicolas_32.wav” has the label 6). Let’s explore the data and listen to some of the audio clips.

We define a function that allows us to play .wav files within a Jupyter notebook: display_example() (click to see code)Accordion Arrow
import tensorflow_io as tfio
from pathlib import Path
from IPython import display

@tf.function
def load_wav_16k_mono(filename):
    """Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio (ensure sample rate is correct)."""
    file_contents = tf.io.read_file(filename)
    wav, sample_rate = tf.audio.decode_wav(file_contents, desired_channels=1)
    wav = tf.squeeze(wav, axis=-1)
    sample_rate = tf.cast(sample_rate, dtype=tf.int64)
    wav = tfio.audio.resample(wav, rate_in=sample_rate, rate_out=16000)
    return wav


def display_example(wav_file_name, audio_rate=16000):
    """Allows us to listen to any wav file and displays its
  in the dataset."""
    wav_file_example = load_wav_16k_mono(wav_file_name)
    label = Path(wav_file_name).parts[-1].split("_")[0]
    print(f"Given label for this example: {label}")
    display.display(display.Audio(wav_file_example, rate=audio_rate))

We can now check out a couple random examples in the dataset.

Here’s an audio clip whose given label is 7:

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_jackson_43.wav"
display_example(wav_file_name_example)

Here’s a different audio clip whose given label is 0:

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_24.wav"
display_example(wav_file_name_example)

Feel free to change the wav_file_name_example variable above to listen to other examples from the dataset.

Define an audio classification model

Our supervised learning task is to classify which digit is being uttered in a given audio snippet. For this, we’ll use a model that has two components: a pretrained network backbone that embeds the audio signal into a vector representation, and a linear model that outputs predictions based on these representations.

Use pre-trained SpeechBrain model to featurize audio

SpeechBrain is an awesome package offering many Pytorch neural networks that have been pretrained on speech data. Here we instantiate an audio feature extractor using SpeechBrain’s EncoderClassifier(), which can be used to embed each audio clip into a vector representation. We specifically use the spkrec-xvect-voxceleb network which has been pre-trained on the VoxCeleb speech dataset.

feature_extractor = EncoderClassifier.from_hparams(
    "speechbrain/spkrec-xvect-voxceleb",
    # run_opts={"device":"cuda"}  # Uncomment this to run on GPU if you have one (optional)
)

Note that the pre-trained feature-extractor was trained on a separate dataset than the one we are searching for label issues in. This is important because cleanlab requires out-of-sample predicted probabilities, as will be explained subsequently.

We run all of our audio clips through the network to extract vector features that we store in: embeddings_array (click to see code)Accordion Arrow
import pandas as pd

# Create dataframe with .wav file names
df = pd.DataFrame(file_paths, columns=["wav_audio_file_path"])
df["label"] = df.wav_audio_file_path.map(lambda x: int(Path(x).parts[-1].split("_")[0]))

# Feature extractor
import torchaudio

def extract_audio_embeddings(model, wav_audio_file_path: str) -> tuple:
    """Feature extractor that embeds audio into a vector."""
    signal, fs = torchaudio.load(wav_audio_file_path)  # Reformat audio signal into a tensor
    embeddings = model.encode_batch(
        signal
    )  # Pass tensor through pretrained neural net and extract representation
    return embeddings

# Extract audio embeddings
embeddings_list = []
for i, file_name in enumerate(df.wav_audio_file_path): # for each .wav file name
    embeddings = extract_audio_embeddings(feature_extractor, file_name)
    embeddings_list.append(embeddings.cpu().numpy())

embeddings_array = np.squeeze(np.array(embeddings_list))
labels = df.label.values

Now we have a traditional ML dataset with features stored in an array embeddings_array and labels stored in another array labels. Each row in the first array corresponds to an audio clip. We’re now able to represent an audio clip as a 512-dimensional feature vector!

print(embeddings_array.shape)
(2500, 512)
print(labels[:50])
array([7, 0, 0, 8, 5, 0, 7, 1, 4, 4, 0, 3, 0, 5, 9, 5, 2, 3, 3, 0, 7, 5,
       6, 0, 8, 2, 4, 8, 5, 7, 0, 9, 2, 9, 9, 3, 1, 0, 7, 9, 1, 1, 8, 3,
       5, 9, 3, 9, 5, 0])

Use linear model to produce predictions

When leveraging pre-trained networks for classification tasks, it is common to add a linear output layer and fine-tune the network weights on new data. However, this can be computationally intensive and typically requires GPUs. Alternatively, we can freeze the weights of the pre-trained model and only tune the weights of the linear output layer, which is much more efficient. This is the strategy we use here.

For simplicity, we can use the linear model provided in sklearn rather than modifying the low-level PyTorch code to append a linear output layer to our network.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

model = LogisticRegression(C=0.01, max_iter=1000, tol=1e-1, random_state=SEED)

scikit-learn models come with tons of awesome functionality such as only needing a single line of code to train the model: model.fit(), or use it for inference: model.predict(). All we’d have to do for training a good audio classification model is execute: model.fit(X=embeddings_array, y=labels). Here we’ll use a slightly different approach.

Compute out-of-sample predicted probabilities

Generally speaking, cleanlab uses predictions from a trained classifier to identify label issues in the data. More specifically, these predictions should be the classifier’s estimate of the conditional probability of each class for a specific example (c.f. sklearn.predict_proba). To learn more details about the algorithm cleanlab uses to find label issues, check out this blogpost on Confident Learning.

Typically we’ll want cleanlab to find label issues in all of the data we have. However when we train a classifier on some of this data, its predictions for that data become untrustworthy due to overfitting. To resolve this issue, we will train our classifier using K-fold cross-validation, which enables us to get out-of-sample predicted probabilities for each example in the dataset. These are predictions from a copy of classifier which was trained on a dataset that did not contain this example and are thus less likely to be overfit. The cross_val_predict method used below enables you to easily generate out-of-sample predicted probabilities from any scikit-learn compatible model.

from sklearn.model_selection import cross_val_predict

cv_pred_probs = cross_val_predict(
    estimator=model, X=embeddings_array, y=labels, cv=5, method="predict_proba"
)

An additional benefit of cross-validation is that it facilitates more reliable evaluation of our model than a single training/validation split. Here’s how to estimate the accuracy of the model trained via cross-validation.

cv_accuracy = (cv_pred_probs.argmax(axis=1) == labels).mean()
print(f"Cross-validated estimate of accuracy on held-out data: {acc}")
Cross-validated estimate of accuracy on held-out data: 0.9772

Models with higher accuracy tend to do a better job finding label errors when used with cleanlab. Thus we should always try to ensure that our model is reasonably performant.

Run Cleanlab to find label issues

Based on the given labels and out-of-sample predicted probabilities, cleanlab can identify label issues in our dataset in one line of code.

# Generate an ordered list of indices corresponding to the audio clips with potential label error
ordered_label_errors = cleanlab.filter.find_label_issues(
    labels=labels,
    pred_probs=cv_pred_probs,
    return_indices_ranked_by="self_confidence",  # ranks the label issues
 )

Output:

[ 516 1946  469 1871 1955 2132]

ordered_label_errors is a list of indices corresponding to examples whose labels are worth inspecting more closely. Above we requested that the indices of the identified issues be sorted by cleanlab’s self-confidence label quality score, which measures the quality of each given label via the probability assigned to it in our model’s prediction. Here are candidate examples that cleanlab tells us to inspect for label error.

df.iloc[ordered_label_errors]

Let’s look at some of the examples that cleanlab thinks may be mislabeled.

In the example below, the given label is 6 but it sounds like 8.

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_14.wav"
display_example(wav_file_name_example)

For each of the three examples below, the given label is 6, but they sound quite ambiguous.

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_36.wav"
display_example(wav_file_name_example)
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_35.wav"
display_example(wav_file_name_example)
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_nicolas_8.wav"
display_example(wav_file_name_example)

Using examples like these to train/evaluate ML models may be a questionable idea!

Conclusion

Here we demonstrated how easy it is to use cleanlab to find label issues in an audio dataset. If there are label errors even in widely-studied and curated datasets like Spoken Digit, then label errors are likely lurking in your own audio data as well. Stop blindly trusting your data! You can integrate cleanlab into your ML development workflows to manage the quality of your data labels.

While cleanlab helps you automatically find data issues, an interface is needed to efficiently fix these issues your dataset. Cleanlab Studio finds and fixes errors automatically in a (very cool) no-code platform. Export your corrected dataset in a single click to train better ML models on better data. Try Cleanlab Studio at https://app.cleanlab.ai/.

cleanlab is undergoing active development and we’re always interested in more open-source contributors!

If you want to stay up-to-date on the latest developments from the Cleanlab team, please:

Additional References

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

An Introduction to Confident Learning: Finding and Learning with Label Errors in Datasets

cleanlab documentation

Cleanlab Studio: no-code data improvement

Related Blogs
CROWDLAB: The Right Way to Combine Humans and AI for LLM Evaluation
CROWDLAB improves your team's LLM Evals process by automatically producing reliable ratings and flagging which outputs need further review.
Read morearrow
How to detect bad data in your instruction tuning dataset (for better LLM fine-tuning)
Overview of automated tools for catching: low-quality responses, incomplete/vague prompts, and other problematic text (toxic language, PII, informal writing, bad grammar/spelling) lurking in a instruction-response dataset. Here we reveal findings for the Dolly dataset.
Read morearrow
Automatically Detect Problematic Content in any Text Dataset
Introducing AI text audits for automated content moderation and curation, including the detection of: toxic, non-English, and informal language, as well as personally identifiable information.
Read morearrow
Get started today
Try Cleanlab Studio for free and automatically improve your dataset — no code required.
More resourcesarrow
Explore applications of Cleanlab Studio via blogs, tutorials, videos, and read the research that powers this next-generation platform.
Join us on Slackarrow
Join the Cleanlab Community to ask questions and see how scientists and engineers are practicing Data-Centric AI.