Back to blog

Use Cleanlab to Improve LLMs: Find Errors in Human Feedback in the Anthropic RLHF Dataset

April 11, 2023
  • Chris MauckChris Mauck
  • Jonas MuellerJonas Mueller

This blog uses Cleanlab Studio (an AI platform for detecting and fixing issues in data) to find mistakes in human feedback (HF) provided during RLHF training of LLM’s like Anthropic’s Claude. This blog is part of our CSA (Cleanlab Studio Audit) series – our way to inform the community about issues in popular datasets. To glean insights about a given dataset, we quickly run it through Cleanlab Studio.

Reinforcement Learning from Human Feedback Data

With Reinforcement Learning from Human Feedback (RLHF) becoming the main way to train AI assistants, it’s great to see organizations like Anthropic making their RLHF dataset publicly available (released as: hh-rlhf in Hugging Face Datasets). We discovered various problems in this dataset just by quickly running it through Cleanlab Studio.

Like other RLHF datasets, every example in this one includes an input prompt and two outputs generated by the LLM: a chosen output and a rejected output, where a human-rater preferred the former over the latter. But Cleanlab Studio reveals that in this dataset: some of the rejected outputs are unequivocally better than the chosen outputs, because humans make mistakes. Below are a couple of the problematic examples detected in the dataset.

Example 1

One problematic example in the dataset

It’s clear here that the human-rejected output answers the question of how to make a pinata whereas the human-chosen output merely describes what a pinata is (and is not actually a better output). The human who provided feedback just accidentally made a mistake here!

Example 2

Again it’s clear that the human-chosen output for this prompt is not truly more desirable than the human-rejected output (unless this LLM was intended to function as a dietitian…)

Using Cleanlab Studio, we found many more such problematic examples where the human-chosen output is just a description of the subject in the prompt, not actually answering the query in the prompt. Fixing such obvious data problems will allow much more reliable Large Language models to be produced via RLHF.

To find & fix such issues in almost any dataset (text, image, tabular, etc), just run it through Cleanlab Studio. Try this universal Data-Centric AI solution for free!

Related Blogs
Letter from the CEO: Announcing our Series A and Cleanlab's Trustworthy Language Model
A personal perspective on the importance of clean data as Cleanlab announces $30M in funding to bring automated data curation to enterprise AI.
Read morearrow
CROWDLAB: The Right Way to Combine Humans and AI for LLM Evaluation
CROWDLAB improves your team's LLM Evals process by automatically producing reliable ratings and flagging which outputs need further review.
Read morearrow
An open-source platform to catch all sorts of issues in all sorts of datasets
With cleanlab v2.6, the most popular library for Data-Centric AI now offers more comprehensive data audits including new checks for underperforming groups, null values, imbalanced classes, and more.
Read morearrow
Get started today
Try Cleanlab Studio for free and automatically improve your dataset — no code required.
More resourcesarrow
Explore applications of Cleanlab Studio via blogs, tutorials, videos, and read the research that powers this next-generation platform.
Join us on Slackarrow
Join the Cleanlab Community to ask questions and see how scientists and engineers are practicing Data-Centric AI.