Overcoming Hallucinations with the Trustworthy Language Model

We’re excited to launch Cleanlab’s Trustworthy Language Model (TLM), which overcomes the biggest barrier to enterprise adoption of LLMs: hallucinations and reliability. By adding a trust score to every LLM response, TLM lets you contain and manage bogus LLM outputs, enabling you to deploy generative AI for new use cases previously unsuitable for LLMs. Through rigorous benchmarking, we’ve shown that TLM both produces more accurate outputs than existing LLMs and has better-calibrated trustworthiness scores (enabling greater cost/time savings) than other common approaches to taming LLM uncertainty.

Create an account to get free access to the TLM API, or experiment with TLM in the playground.

LLMs’ biggest challenge: hallucinations

A recent Gartner poll shows that while 55% of organizations are experimenting with generative AI, only 10% have put generative AI into production. One of the biggest barriers to productionizing LLMs is dealing with their tendency to produce bogus outputs known as hallucinations, which precludes their use in applications where correct outputs are necessary (i.e., most applications)!

Despite this, some organizations have deployed these unreliable LLMs, sometimes with catastrophic results. Air Canada’s chatbot hallucinated refund policies, eventually resulting in the airline being held responsible for the misinformation and being ordered by a tribunal to refund a customer; the chatbot has since been taken down. A federal judge fined a law firm after their lawyers used ChatGPT to draft a brief full of fabricated citations. New York City’s “MyCity” chatbot has been hallucinating wrong answers to business owners’ questions about local laws.

Overcoming hallucinations with trustworthiness scores

LLMs will always have some hallucinations, but by providing a trustworthiness score with every output, Cleanlab TLM lets you identify when the LLM is hallucinating. TLM is optimized for minimizing false negatives — when the LLM hallucinates, we want to make sure the trustworthiness score is low — to enable reliable deployment of LLM-based applications.

The TLM API can serve as:

A drop-in replacement for your LLM. Much like existing LLM APIs, TLM provides a .prompt() method, and TLM will return a response along with a trustworthiness score, enabling new use cases.
- Even the responses themselves are more accurate than the baseline model, because TLM internally produces many responses and returns the one with the highest trustworthiness score.
A layer of trust for your existing LLM outputs or human-generated data. TLM provides a .get_trustworthiness_score() method that can score any prompt/response pair.

TLM works by augmenting existing LLMs with a layer of trust. The generally-available version of TLM lets you choose between a number of popular base models, including GPT-3.5, GPT-4, and GPT-4o, but TLM can augment any LLM with only black-box access to the LLM API. For enterprise use cases, such as adding trustworthiness to your custom fine-tuned LLM, contact us.

Berkeley Research Group (BRG) has already seen significant cost savings from leveraging TLM. According to Steven Gawthorpe, PhD, Associate Director and Senior Data Scientist at BRG:

While there are always other tools out there, Cleanlab’s TLM is the first viable answer to LLM hallucinations that I’ve seen. Several of our human-in-the-loop LLM workflows can now be 80% automated with Cleanlab’s trustworthiness scores on every LLM output. Doing this manually for the entire dataset is often impossible, but Cleanlab gives us the power of 1000s of data scientists to enrich data and strengthen LLM outputs. The downstream cost savings of using TLM for accurate data are substantial, providing significant financial benefits with 10x to 100x ROI for many of our clients. Other tools on the market aren’t even on the same playing field compared to what Cleanlab is doing.

Use cases enabled by TLM

Trustworthiness scores unlock new production use cases of LLMs, and any existing application of LLMs can also benefit by taking into account these scores.

Customer service chatbot

TLM can power trustworthy chatbots that answer the 80% of questions where they are confident, but where they escalate to a human if they’re unsure about a response rather than hallucinating one (like in the Air Canada case). This can be done simply by routing the question to a human when the trustworthiness score falls below a chosen threshold.

Auto-labeling

LLMs are commonly used for auto-labeling data. With TLM, you can confidently auto-label a large fraction of your data and only have humans review a portion of the data where the LLM does not return trustworthy results.

template = '''
What type of compliance issue is most likely present in the following document?
Please restrict your answer to a one word answer and nothing else.
Your answer should be selected from the following options: HIPAA, FERPA, GDPR, none.

Document below here:

{document}
'''

def classify(document) -> Tuple[str, float]:
  answer = tlm.prompt(template.format(document=document))
  return answer['response'], answer['trustworthiness_score']

If you were to use this prompt to classify a large number of legal documents, you’d find that the documents with high trustworthiness scores were labeled correctly, while the documents with low scores had labels that needed to be double-checked:

document	response	trustworthiness
All medical health records will be accessed one way only. The patient’s medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.	HIPAA	0.984
⋮	⋮	⋮
TechTarget’s Cookies Policy includes the following terminology: “By continuing to use the site, you agree to the use of cookies.”	FERPA	0.426

For more on this use case, see the TLM auto-labeling tutorial.

Data extraction

TLM can also be used for open-domain data extraction. Our TLM information extraction tutorial walks through an example use case of extracting key information from electronics parts datasheet PDFs like the following:

If you were populating a parts catalog, you might be interested in extracting information like operating voltage from such documents, where TLM’s trustworthiness scores can separate good outputs from bad:

part	operating voltage	trustworthiness
ATtiny44A	1.8 - 5.5V	0.937
⋮	⋮	⋮
ZRE200GE	1V - 15V DC	0.567

… and more

The examples above just scratch the surface of reliable AI applications that become possible with TLM. We’re continually adding hands on tutorials for new applications of TLM, such as:

Evaluating TLM Performance

We evaluate TLM’s ability to add trust to arbitrary LLMs by benchmarking TLM against OpenAI’s state-of-the-art GPT-4 LLM. Our comprehensive benchmarks investigate two questions to evaluate the reliability of TLM’s (1) responses, and (2) trustworthiness scores:

How accurate are TLM responses compared to the baseline LLM?
To meet a required error rate by flagging low-scoring LLM responses for human review, how much costs/time does a team save by scoring responses via TLM vs. existing confidence estimation approaches?

The second item can be rephrased as: How many wrong LLM responses can we catch under a limited review budget by prioritizing via trustworthiness scores? When investigating this, we compare against two popular approaches to estimate the confidence of the baseline LLM:

Self-Eval: Asking the LLM to evaluate its own output and rate its confidence on a scale of 1-5. This is done in a subsequent request to the model (details in Appendix).
Probability: Relying on the probability of the generated output given by the language model, as recommended by OpenAI. This is called the perplexity in AI research, and is the average log probability of tokens in the LLM response, obtained from the raw output of the underlying autoregressive neural network.

Both of these confidence measures merely quantify the aleatoric uncertainty (known unknowns) in model predictions. This is uncertainty the model is aware of due to a known challenging prompt (e.g., incomplete/vague request). TLM’s trustworthiness score additionally quantifies epistemic uncertainty (unknown unknowns), which arises when the model was not previously trained on data similar to a given request.

Benchmark datasets

Our study focuses on Q&A settings. Unlike other LLM benchmarks, we never measure benchmark performance using LLM-based evaluations. All of our benchmarks involve questions with a single correct answer, and benchmark performance is based on whether or not the LLM response matches this known ground-truth answer. We consider these popular Q&A datasets:

TriviaQA: Open-domain trivia questions.
ARC: Grade school multiple-choice questions (we consider the “Challenge Test” subset).
SVAMP: Elementary-level math word problems.
GSM8k: Grade school math problems.
Diagnosis: Diagnosing medical conditions based on symptom descriptions from the patient.

The next sections show some benchmark examples and the corresponding TLM outputs.

Examples from benchmark where TLM responded correctly

Prompt: If 6 potatoes makes 36 hash browns, how many hash browns can you make out of 96 potatoes?

TLM Output: 576 Trustworthiness Score: 0.993

Prompt: You are a doctor looking at a patient’s symptoms. Classify the Symptoms into a single Diagnosis that best represents them. The list of available Diagnosis is: [cervical spondylosis, impetigo, urinary tract infection, arthritis, dengue, common cold, drug reaction, fungal infection, malaria, allergy, bronchial asthma, varicose veins, migraine, hypertension, gastroesophageal reflux disease, pneumonia, psoriasis, diabetes, jaundice, chicken pox, typhoid, peptic ulcer disease].
Symptoms: I have a feeling of food or acid backing up into my throat. I have chest pain which gets worse if I lie down. I get frequent heartburn or indigestion, after eating food and vomit it out.

TLM Output: gastroesophageal reflux disease Trustworthiness Score: 0.994

Examples from benchmark where TLM responded incorrectly

Prompt: Emil is 19 years old now. When he turns 24, he will be half the age of his dad but twice as old as his brother. What is the sum of the ages of his dad and his brother now?

TLM Output: 65 Trustworthiness Score: 0.123 (Ground-Truth Answer: 50)

Prompt: On a standard dartboard, which number lies opposite number 4?

TLM Output: 18 Trustworthiness Score: 0.379 (Ground-Truth Answer: 16)

Prompt: You are a doctor looking at a patient’s symptoms. Classify the Symptoms into a single Diagnosis that best represents them. The list of available Diagnosis is: [cervical spondylosis, impetigo, urinary tract infection, arthritis, dengue, common cold, drug reaction, fungal infection, malaria, allergy, bronchial asthma, varicose veins, migraine, hypertension, gastroesophageal reflux disease, pneumonia, psoriasis, diabetes, jaundice, chicken pox, typhoid, peptic ulcer disease].
Symptoms: I have a severe headache that feels like pressure in my head. I also have a mild fever and small red spots on my back.

TLM Output: migraine Trustworthiness Score: 0.221 (Ground-Truth Answer: dengue)

Benchmark Results

The following table reports the accuracy of responses from TLM and GPT-4 across each benchmark dataset:

Dataset	OpenAI GPT-4 API	Cleanlab TLM API
TriviaQA	84.7%	84.8%
ARC	94.6%	94.9%
SVAMP	90.7%	91.7%
GSM8k	46.5%	55.6%
Diagnosis	67.4%	68.0%

TLM consistently improves the accuracy of the baseline GPT-4 LLM across all datasets.

Next, we evaluate the three aforementioned approaches to estimate trustworthiness scores for each LLM response (again using GPT-4 as the baseline LLM): TLM, Self-Eval, Probability. The following plot reports the error rate of LLM responses amongst the top-K% of responses with the highest trustworthiness scores in each dataset:

Across all datasets, TLM trustworthiness scores allow us to more reliably detect bad LLM responses than the Self-Eval or Probability scores. If a team has to ensure a max-acceptable error rate by manually reviewing the low-scoring LLM responses, enormous reviewing costs/time can be saved by adopting TLM scores. For instance, a team could achieve near-zero error rates for the SVAMP dataset by only inspecting ~20% of the LLM responses when relying on TLM trustworthiness, but would have to inspect nearly 40% or 90% of the data when relying on Probability or Self-Eval scores.

We additionally evaluate the utility of these trustworthiness scores via: the probability that LLM response #1 receives a higher trustworthiness score than LLM response #2, where the former is randomly selected from the subset of model responses that were correct, and the latter from the subset of model responses that were incorrect. Widely used to assess diagnostic scores, this evaluation metric is known as the Area under the Receiver Operating Characteristic Curve (AUROC). The following table reports the AUROC achieved by each trustworthiness scoring method in each dataset:

Dataset	Probability	Self-Eval	TLM
TriviaQA	0.704	0.623	0.812
ARC	0.755	0.659	0.861
SVAMP	0.943	0.793	0.973
GSM8k	0.883	0.868	0.994
Diagnosis	0.614	0.654	0.711

Additional benchmarks are presented in the Appendix, in particular with other versions of TLM built around GPT-3.5 and GPT-4o instead of GPT-4. The benchmarks reveal that TLM can reduce the error rate (incorrect answers): of GPT-4 by up to 10%, of GPT-4o by up to 27%, and of GPT-3.5 by up to 22%. The trustworthiness estimates output by TLM are significantly more effective for catching bad answers, across different evaluation metrics, datasets, and LLMs.

Conclusion

This article shows how the TLM technology can boost the reliability of any LLM application via trustworthiness scores and more accurate responses. You can use Cleanlab’s TLM built on top of popular LLMs, or contact us to convert your own LLM into a TLM (requires no additional training of the LLM or access to its training data or model weights).

Of course, there’s no free lunch. TLM requires extra computation in order to provide these benefits. It internally calls the underlying LLM multiple times to self-reflect on candidate responses and assess the consistency between candidate responses. Learn more via the documentation. TLM is thus most useful for higher-stakes AI applications that require reliability and no unchecked hallucinations.

We will soon launch a web interface to run TLM over big datasets, which inevitably contain edge-cases that cause bad LLM outputs. You’ll be able to catch and remediate these with the trustworthiness score, all with a few clicks.

Resources

Play with Cleanlab’s TLM in our interactive demo.
Try the actual TLM API for free, and run various tutorial use-cases.
Read about TLM in today’s News.
Join our Slack community to discuss reliable AI + follow us on Twitter & LinkedIn.

Appendix

Expand each collapsible section below to learn more.

Additional GPT 4 benchmark results.

Here we supplement our AUROC evaluation of various trustworthiness scores’ utility with an additional evaluation metric. When we see a higher trustworthiness score, AUROC intuitively quantifies how much more confident can we truly be that the LLM answer is actually correct. Ideally, we’d like trustworthiness scores near 1 for LLM responses that are correct and near 0 for incorrect responses. However, AUROC does not quantify how different we can expect trustworthiness scores to look for correct vs. incorrect LLM answers.

The table below reports a measure of this separation via the Confidence Gap, defined as the difference between two averages. The first average is taken over the trustworthiness scores for LLM responses that were correct, the latter over the scores for incorrect responses.

Dataset	Probability	Self-Eval	TLM
TriviaQA	0.0714	0.123	0.219
ARC	0.0193	0.23	0.316
SVAMP	0.208	0.566	0.633
GSM8k	0.347	0.707	0.772
Diagnosis	0.029	0.159	0.164

Benchmark results for GPT 3.5.

To assess how effectively the TLM technology adds trust to arbitrary LLMs, we repeat the earlier benchmarks with a different baseline LLM model. In this section, we use OpenAI’s GPT-3.5 LLM instead of GPT-4. Our TLM implementation on top of GPT-3.5 solely relies on this baseline LLM. Neither GPT-4 nor any other powerful LLM is used in any of the results presented in this section. See the earlier sections for definitions of each evaluation metric presented here.

The following table reports the accuracy of responses from TLM and GPT-3.5 across each benchmark dataset:

Dataset	GPT-3.5	TLM
TriviaQA	73.0%	75.2%
ARC	82.2%	85.6%
SVAMP	79.6%	84.5%
GSM8k	68.8%	76.7%
Diagnosis	58.3%	58.6%

The following table reports the AUROC achieved by each trustworthiness scoring method when applied with GPT-3.5:

Dataset	Probability	Self-Eval	TLM
TriviaQA	0.648	0.617	0.837
ARC	0.739	0.604	0.902
SVAMP	0.572	0.594	0.886
GSM8k	0.726	0.559	0.773
Diagnosis	0.611	0.570	0.733

The following table reports the Confidence Gap achieved by each trustworthiness scoring method when applied with GPT-3.5:

Dataset	Probability	Self-Eval	TLM
TriviaQA	0.0587	0.171	0.272
ARC	0.0314	0.0882	0.381
SVAMP	0.0424	0.113	0.349
GSM8k	0.0829	0.0794	0.219
Diagnosis	0.025	0.071	0.124

To reflect expected performance in applications where our AI can abstain from responding, the following table reports the accuracy of LLM responses whose associated trustworthiness score falls in the top 80% over each dataset (all based on GPT-3.5):

Dataset	Probability	Self-Eval	TLM
TriviaQA	76.9%	77.9%	83.5%
ARC	86.1%	84.1%	94.8%
SVAMP	80.8%	82.6%	91.8%
GSM8k	75.9%	72.3%	78.3%
Diagnosis	60.8%	62.3%	64.8%

Benchmark results for GPT 4o.

To assess how effectively the TLM technology adds trust to arbitrary LLMs, we repeat the earlier benchmarks with a different baseline LLM model. In this section, we use OpenAI’s GPT-4o LLM instead of GPT-4. Our TLM implementation on top of GPT-4o solely relies on this baseline LLM and no other LLM model. See the earlier sections for definitions of each evaluation metric presented here.

The following table reports the accuracy of responses from TLM and GPT-4o across each benchmark dataset:

Dataset	GPT-4o	TLM
TriviaQA	88.2%	89.2%
ARC	96.6%	96.7%
SVAMP	95.0%	95.2%
GSM8k	74.1%	81.2%
Diagnosis	68.9%	69.3%

The following table reports the AUROC achieved by each trustworthiness scoring method when applied with GPT-4o:

Dataset	Probability	Self-Eval	TLM
TriviaQA	0.64	0.574	0.817
ARC	0.815	0.686	0.850
SVAMP	0.612	0.589	0.788
GSM8k	0.424	0.504	0.659
Diagnosis	0.73	0.596	0.722

The following table reports the Confidence Gap achieved by each trustworthiness scoring method when applied with GPT-4o:

Dataset	Probability	Self-Eval	TLM
TriviaQA	0.023	0.122	0.222
ARC	0.060	0.336	0.333
SVAMP	0.019	0.120	0.076
GSM8k	-0.005	0.003	0.020
Diagnosis	0.023	0.103	0.163

Dataset	Probability	Self-Eval	TLM
TriviaQA	89.8%	89.0%	94.8%
ARC	98.7%	97.8%	99.2%
SVAMP	96.3%	95.8%	97.7%
GSM8k	72.8%	74.5%	77.0%
Diagnosis	74.8%	73.6%	75.8%

Additional benchmarking details.

Benchmarks of TLM accuracy were run using the best quality preset for the TLM, which samples multiple candidate responses and returns the one with highest trusworthiness score. Benchmarks of the TLM trustworthiness score were run using default settings, which do not attempt to improve the response from the baseline LLM and merely score its trustworthiness. In all benchmarks, the TLM never accessed a more powerful LLM than the baseline model being compared against.

The prompt used for the Self-Eval method, via a separate request to have the LLM evaluate its previous response was:

Question: {question}
Answer: {LLM response}
Evaluate how confident you are that the given Answer is a good and accurate response to the Question.
Please assign a Score using the following 5-point scale:
1: You are not confident that the Answer addresses the Question at all, the Answer may be entirely off-topic or irrelevant to the Question.
2: You have low confidence that the Answer addresses the Question, there are doubts and uncertainties about the accuracy of the Answer.
3: You have moderate confidence that the Answer addresses the Question, the Answer seems reasonably accurate and on-topic, but with room for improvement.
4: You have high confidence that the Answer addresses the Question, the Answer provides accurate information that addresses most of the Question.
5: You are extremely confident that the Answer addresses the Question, the Answer is highly accurate, relevant, and effectively addresses the Question in its entirety.
The output should strictly use the following template: Explanation: [provide a brief reasoning you used to derive the rating Score] and then write ‘Score: <rating>’ on the last line.

For this Self-Eval method, we also tried having the LLM report confidence in its original answer on more continuous numeric scales (e.g., 1-10 or 1-100), but the resulting scores performed worse.

Prompts used for each benchmark dataset.

Throughtout all benchmarks, TLM and the baseline LLM are prompted using the exact same prompts. In competitive AI research, these types of benchmarks are run with sophisticated many-shot chain-of-thought prompts to maximize raw LLM accuracy (example). While such complex prompting helps Foundation model providers show their new model is better than everybody else’s, it does not reflect the types of queries from typical users. Our benchmarks here use simple prompts to better reflect how LLMs are used to drive real-world business value. We are not focused on optimizing prompts to maximize LLM accuracy, and instead focus on studying the benefits of adding the TLM technology to any LLM.

The specific prompts we used to run our LLMs on each dataset are listed below.

TriviaQA:

{question text}
Therefore, the answer is

ARC:

For GPT-3.5 and GPT-4:

{multiple-choice question text}
Therefore, among A through D, the answer is:

For GPT-4o (for which the above prompt produced overly lengthy answers):

{multiple-choice question text}
Please restrict your answer to one letter from A to D and nothing else.

SVAMP:

{question text}
Therefore, the answer (arabic numerals) is:

GSM8k:

{question text}
Therefore, the answer (arabic numerals) is:

Diagnosis:

You are a doctor looking at a patient’s symptoms. Classify the Symptoms into a single Diagnosis that best represents them. The list of available Diagnosis is: [cervical spondylosis, impetigo, urinary tract infection, arthritis, dengue, common cold, drug reaction, fungal infection, malaria, allergy, bronchial asthma, varicose veins, migraine, hypertension, gastroesophageal reflux disease, pneumonia, psoriasis, diabetes, jaundice, chicken pox, typhoid, peptic ulcer disease]. Consider why the Symptoms reflect a specific Diagnosis. In your response, respond with only a single Diagnosis out of the list. Do not write anything else.
Symptoms: {text from patient description}

Removing bad data from the benchmark datasets.

When studying initial benchmark results, we observed multiple examples where the TLM output was very high confidence, but did not match the correct answer listed in the benchmark dataset. Upon closer inspection, most of such examples actually had incorrect answers in the benchmark dataset, which can mislead AI research.

Thanks to TLM’s trustworthiness we were able to catch these incorrect answers (each was manually verified) and remove them from our benchmark. The bad data we removed from the benchmarks is shared on Hugging Face.

Example error found in the GSM8K dataset:

Question: After scoring 14 points, Erin now has three times more points than Sara, who scored 8. How many points did Erin have before?

Answer According to the Dataset: 18

TLM Trustworthiness Score for this Answer: 0.000961

(Actual Answer we determined: 10)

Example error found in the SVAMP dataset:

Question: Rachel’s tree had 4 apples. She picked 2 apples from her tree. Thereafter 3 new apples grew on the tree. How many apples are there on the tree now?

Answer According to the Dataset: 1

TLM Trustworthiness Score for this Answer: 0.001508

(Actual Answer we determined: 5)

Browse all

Introducing cleanlab's dual new methods to detect outliers and how they perform on real image data.

Letter from the CEO: Announcing Our Seed Funding and the Launch of Cleanlab Studio for Enterprise

Cleanlab Studio for Enterprise launches to automate data curation for LLMs and the modern AI stack with $5 million in seed funding from Bain Capital Ventures.

Improving Legal Judgement Prediction with Data-Centric AI

A legal sector case study using Cleanlab Studio to produce better models for making predictions (eg. of final judgements) based on court case documents.

Get started today

Try Cleanlab Studio for free and automatically improve your dataset — no code required.

Try for freeContact sales

More resources

Explore applications of Cleanlab Studio via blogs, tutorials, videos, and read the research that powers this next-generation platform.

Join us on Slack

Join the Cleanlab Community to ask questions and see how scientists and engineers are practicing Data-Centric AI.