Improving any OpenAI Language Model by Systematically Improving its Data

This article demonstrates how data-centric AI tools like cleanlab can improve a fine-tuned Large Language Model (LLM; a.k.a. Foundation Model). These tools optimize the dataset itself rather than altering the model architecture/hyperparameters — running the exact same fine-tuning code on the improved dataset boosts test-set performance by 37% on a politeness classification task studied here. We achieve similar accuracy gains via the same data-centric AI process across 3 state-of-the-art LLM models one can fine-tune via the OpenAI API: Davinci, Ada, and Curie. These are variants of the base LLM underpinning GPT-3/ChatGPT.

Background

Labeled data powers AI/ML in the enterprise, but real-world datasets have been found to contain between 7-50% annotation errors. Imperfectly-labeled text data hampers the training (and evaluation of) ML models across tasks like intent recognition, entity recognition, and sequence generation. Although pretrained LLMs are equipped with a lot of world knowledge, their performance is adversely affected by noisy training data (as noted by OpenAI). Here we illustrate data-centric techniques to mitigate the effect of label noise without changing any code related to model architecture, hyperparameters, or training. These data quality improvement techniques should thus remain applicable even for future advanced LLMs like GPT-10.

Why Fine-tuning?

LLMs acquire powerful generative and discriminative capabilities after being pre-trained on most text across the internet. Nonetheless, ensuring the LLM produces reliable outputs for a particular business use-case often requires additional training on actual data from this domain labeled with the desired outputs. This domain-specific training is known as fine-tuning the LLM and can be done via APIs offered by OpenAI. Imperfections in the data annotation process inevitably introduce label errors in this domain-specific training data, posing a challenge for proper fine-tuning and evaluation of the LLM.

Why Data-Centric AI?

Here are quotes from OpenAI on their strategy for training state-of-the-art AI systems:

“Since training data shapes the capabilities of any learned model, data filtering is a powerful tool for limiting undesirable model capabilities.”

“We prioritized filtering out all of the bad data over leaving in all of the good data. This is because we can always fine-tune our model with more data later to teach it new things, but it’s much harder to make the model forget something that it has already learned.”

Clearly dataset quality is a vital consideration. Some organizations like OpenAI manually handle issues in their data to produce the very best models, but this is tons of work! Data-centric AI is an emerging science of algorithms to detect data issues, so you can systematically improve your dataset more easily with automation. Cleanlab develops some of the most popular open-source and enterprise tools for practicing data-centric AI to find and fix issues in real-world data (image, text, tabular, audio, etc).

Our LLM in these experiments is the Davinci model from OpenAI, which is their most capable GPT-3 model, upon which ChatGPT is based.

Overview

Here we consider a 3-class variant of the Stanford Politeness Dataset, which has text phrases labeled as: impolite, neutral, or polite. Annotated by human raters, some of these labels are naturally low-quality.

This article walks through the following steps:

Use the original data to fine-tune different state-of-the-art LLMs via the OpenAI API: Davinci, Ada, and Curie.
Establish the baseline accuracy of each fine-tuned model on a test set with high-quality labels (established via consensus and high-agreement amongst many human annotators who rated each test example).
Use thefind_label_issues()method from the cleanlab package to automatically identify hundreds of mislabeled examples.
Remove the data with automatically-flagged label issues from the dataset, and then fine-tune the exact same LLMs on the auto-filtered dataset. This simple step reduces the error in Davinci model predictions by 8%!
Introduce a no-code solution to efficiently fix the label errors in the dataset, and then fine-tune the exact same LLM on the fixed dataset. This reduces the error in Davinci model predictions by 37%!

Similar gains are achieved via these same processes for the Ada and Curie models — in all cases, nothing was changed about the model nor the fine-tuning code!

Here’s a notebook you can run to reproduce the results demonstrated in this article and understand the code to implement each step.

Politeness Dataset

You can download the train and test sets here: train test

Our training dataset has 1916 examples each labeled by a single human annotator, and thus some may be unreliable. The test dataset has 480 examples each labeled by five annotators, and we use their consensus label as a high-quality approximation of the true politeness (measuring test accuracy against these consensus labels). To ensure a fair comparison, this test dataset remains fixed throughout our experiments (all label cleaning / dataset modification is only done in the training set). We reformat these CSV files into the jsonl file type required by OpenAI’s fine-tuning API.

Fine-tune and Evaluate LLM

Here’s how our code looks to fine-tune the Davinci LLM for 3-class classification and evaluate its test accuracy:

python

0
1
2
3
4
!openai api fine_tunes.create -t "train_prepared.jsonl" -v "test_prepared.jsonl" --compute_classification_metrics --classification_n_classes 3 -m davinci--suffix "baseline"
>>> Created fine-tune: ft-9800F2gcVNzyMdTLKcMqAtJ5

Once the job completes, we query a fine_tunes.results endpoint to see the test accuracy achieved when fine-tuning this LLM on the original training dataset.

python

0
1
2
3
4
5
6
!openai api fine_tunes.results -i ft-9800F2gcVNzyMdTLKcMqAtJ5 > baseline.csv
df = pd.read_csv('baseline.csv')baseline_acc = df.iloc[-1]['classification/accuracy']
>>> Fine-tuning Accuracy: 0.6312500238418579

Our baseline Davinci LLM achieves a test accuracy of 63% when fine-tuned on the raw training data with possibly noisy labels. Even a state-of-the-art LLM like the Davinci model produces lackluster results for this classification task, is it because the data labels are noisy?

Automatically Find Label Issues

The cleanlab Python package employs Confident Learning algorithms to estimate which data are mislabeled in a classification dataset. These algorithms require out-of-sample predicted class probabilities for all of our training examples and apply a novel form of calibration to determine when to trust the model over the given label in the data.

To obtain these predicted probabilities we:

Use the OpenAI API to compute embeddings from the Davinci model for all of our training examples. You can download the embeddings here.
Fit a logistic regression model on the embeddings and labels in the original data. We use 10-fold cross-validation which allows us to produce out-of-sample predicted class probabilities for every example in the training dataset.

python

0
1
2
3
4
5
6
7
8
9
10
11
12
13
# Get embeddings from OpenAI.from openai.embeddings_utils import get_embedding
embedding_model = "text-similarity-davinci-001"train["embedding"] = train.prompt.apply(lambda x: get_embedding(x, engine=embedding_model))embeddings = train["embedding"].values
# Get out-of-sample predicted class probabilities via cross-validation.from sklearn.linear_model import LogisticRegression
model = LogisticRegression()labels = train["completion"].valuespred_probs = cross_val_predict(estimator=model, X=embeddings, y=labels, 		                   cv=10, method="predict_proba")

With just one line of code, cleanlab estimates which examples have label issues in our training dataset.

python

0
1
2
3
4
5
from cleanlab.filter import find_label_issues
# Get indices of examples estimated to have label issues:issue_idx = find_label_issues(labels, pred_probs,            return_indices_ranked_by='self_confidence')  # sort indices by likelihood of label error

Let’s take a look at a few of the label issues automatically identified in our dataset. Here’s one example that is clearly mislabeled:

Phrase: I’ll take a look at getLogEntries when I have time. Would you mind adding me as a committer?
Label: impolite

Labeling errors like this are why we might be seeing poor model results.

A few of the labels issues found by cleanlab.

Note: find_label_issues is able to determine which of the given labels are potentially incorrect given only the out-of-sample pred_probs.

Filter label issues and fine-tune a more robust LLM

Now that we have the indices of potentially mislabeled examples (identified via automated techniques), let’s remove these 471 examples from our training dataset. Fine-tuning the exact same Davinci LLM on the filtered dataset achieves a test accuracy of 66% (on the same test data where our original Davinci LLM achieved 63% accuracy). We reduced the error-rate of the model by 8% using less but better quality training data!

python

0
1
2
3
4
5
6
7
8
9
10
11
12
13
# Remove the label errors found by cleanlab.train_cl = train.drop(issue_idx).reset_index(drop=True)format_data(train_cl, "train_cl.jsonl")
# Train a more robust classifier with less erroneous data.!openai api fine_tunes.create -t "train_cl_prepared.jsonl" -v "test_prepared.jsonl" --compute_classification_metrics --classification_n_classes 3 -m davinci --suffix "dropped"
# Evaluate model on test data.!openai api fine_tunes.results -i ft-InhTRQGu11gIDlVJUt0LYbEx > cleanlab.csvdf = pd.read_csv('cleanlab.csv')dropped_acc = df.iloc[-1]['classification/accuracy']
>>> 0.6604166626930237

Fixing the Label Errors

Instead of just dropping the potential label issues, the smarter (yet more complex) way to improve our dataset would be to correct the label issues by hand. This simultaneously removes a noisy data point and adds an accurate one, but making such corrections manually is cumbersome.

Cleanlab Studio provides a user-friendly interface to make these changes without writing a single line of code. Simply upload your dataset and Studio computes everything we just did above via provided AI and data-quality algorithms, so you can spend more time fixing the issues instead of just finding them.

Studio automatically flags examples it thinks are likely mislabeled and provides suggested label corrections for the relevant data such that a dataset can be quickly improved. Here, we use the auto-fix feature on this dataset and replace the Studio-found label issues with the automatically-suggested label. From data upload to data export, the whole process took only 5 minutes.

We then fine-tune the exact same Davinci LLM on the dataset corrected with Cleanlab Studio. The resulting model achieves 77% accuracy on the same test dataset as before, which is a 37% reduction in error from our original version of this model.

python

0
1
2
3
4
5
6
7
8
9
10
11
12
13
# Load in and format data improved with Studio.train_studio = pd.read_csv('train_studio.csv')format_data(train_studio, "train_studio.jsonl")
# Train a more robust classifier with less erroneous data.!openai api fine_tunes.create -t "train_studio_prepared.jsonl" -v "test_prepared.jsonl"--compute_classification_metrics --classification_n_classes 3 -m davinci --suffix "studio"
# Evaluate model on test data.!openai api fine_tunes.results -i ft-MQbaduYd8UGD2EWBmfpoQpkQ > studio.csvdf = pd.read_csv('studio.csv')dropped_acc = df.iloc[-1]['classification/accuracy']>>> 0.7729166746139526

Note: throughout this entire process, we never changed any code related to model architecture/hyperparameters, training, or data preprocessing! All improvement strictly comes from increasing the quality of our training data, which leaves room for additional optimizations on the modeling side.

Evaluating other LLMs

We repeated this same experiment with two other recent LLM models OpenAI offers for fine-tuning: Ada and Curie. The resulting improvements look similar to those achieved for the Davinci model.

Conclusion

Data-centric AI is a powerful paradigm for handling noisy data via AI/automated techniques rather than the tedious manual effort data scientists often dread. Tools like Cleanlab help you efficiently find and fix data and label issues that can be used to improve any ML model (not just LLMs) for most types of data (not just text, but also images, audio, tabular data, etc). Open-source versions of such tools can utilize any ML model to diagnose/fix issues in the data and then improve the data for any other ML model. No-code platforms come with good ML models built-in and interfaces to quickly correct data issues.

These sorts of tools will still remain applicable with future advances in ML models like GPT-10, and will only become better at identifying issues when used with more accurate models! Practice data-centric AI to systematically engineer better data via AI/automation. This frees you to capitalize on your unique domain knowledge rather than fixing general data issues like label errors.

Next Steps

Easily improve your data with Cleanlab Studio!
Star our github repo to support open-source development for Data-Centric AI.
Join our Community Slack to discuss using AI/automation to improve your data.
Follow us on LinkedIn or Twitter to stay up-to-date on the best data quality tools.

Browse all Next

A case study on a reliable Customer Support Agent built with LangGraph and automated trustworthiness scoring

Real-Time Error Detection for LLM Structured Outputs: A Comprehensive Benchmark

Tools to automatically detect errors in Structured Outputs or Extracted Data produced by any LLM.

Automatically boost the accuracy of any LLM, without changing your prompts or the model

Demonstrating how the Trustworthy Language Model system can produce better responses from a wide variety of LLMs

Improving any OpenAI Language Model by Systematically Improving its Data

Background

Why Fine-tuning?

Why Data-Centric AI?

Overview

Politeness Dataset

Fine-tune and Evaluate LLM

Automatically Find Label Issues

Filter label issues and fine-tune a more robust LLM

Fixing the Label Errors

Evaluating other LLMs

Conclusion

Next Steps

Platform

Resources

Community

Company