Automatically detecting LLM hallucinations with models like GPT-4o and Claude

September 4, 2024
  • Hui Wen GohHui Wen Goh
  • Jay ZhangJay Zhang
  • Ulyana TkachenkoUlyana Tkachenko
  • Jonas MuellerJonas Mueller

The Trustworthy Language Model (TLM) is a system for reliable AI that adds a trustworthiness score to every LLM response. Compatible with any base LLM model, TLM now comes with out-of-the-box support for new models from OpenAI and Anthropic including: GPT-4o, GPT-4o mini, and Claude 3 Haiku. This article shares comprehensive benchmarks comparing the hallucination detection performance of TLM against other hallucination-scoring strategies using these same LLM models:

  • The Self-Eval strategy asks the LLM to rate its own output in an additional request.
  • The Probability strategy aggregates token probabilities output by the model (this is not available for all LLMs though, for instance those from Anthropic or AWS Bedrock).
DatasetProbabilitySelf-EvalTLM
TriviaQA89.8%89.0%94.8%
ARC98.7%97.8%99.2%
SVAMP96.3%95.8%97.7%
GSM8k72.8%74.5%77.0%
Diagnosis74.8%73.6%75.8%

The table above reports results obtained using OpenAI’s GPT-4o model with trustworthiness scores computed via one of three strategies: TLM, Self-Eval, or Probability. Each row lists the accuracy of LLM responses over 80% of the corresponding dataset, specifically the subset of examples that received the highest trustworthiness scores.

One framework for reliable AI is to have the system abstain from responding when estimated trustworthiness is too low, particularly in human-in-the-loop workflows where we only want LLMs to automate the subset of tasks that they can reliably handle. The results above show that trustworthiness scores from TLM yield significantly more reliable AI in such applications. For instance: TLM enables your team to ensure < 1% error rates over the ARC dataset while reviewing < 20% of the data, savings that are not achievable via other scoring techniques.

Below, we present additional benchmarks, while keeping this update concise. To learn more about TLM and these benchmarks, refer to our original blogpost, which provides all of the details and many additional results.

Hallucination Detection Benchmark

All of our benchmarks involve questions with a single correct answer, and benchmark performance is based on whether or not the LLM response matches this known ground-truth answer. This is unlike other LLM benchmarks that rely on noisy LLM-based evaluations. When presenting results for each specific base LLM model, only that model is used to produce responses and evaluate their trustworthiness – no other LLM is involved. All prompting and usage of the base LLM remains identical across all strategies studied here.

Each of the considered hallucination detection strategies produces a score for every LLM response. For instance, the Probability strategy scores responses by their perplexity, the average of the log token probabilities. The Self-Eval strategy asks the LLM to rate its confidence on a 1-5 Likert scale using Chain-of-Thought prompting.

Prompt used for Self-Eval strategy.Accordion Arrow

Question: {question}
Answer: {LLM response}
Evaluate how confident you are that the given Answer is a good and accurate response to the Question.
Please assign a Score using the following 5-point scale:
1: You are not confident that the Answer addresses the Question at all, the Answer may be entirely off-topic or irrelevant to the Question.
2: You have low confidence that the Answer addresses the Question, there are doubts and uncertainties about the accuracy of the Answer.
3: You have moderate confidence that the Answer addresses the Question, the Answer seems reasonably accurate and on-topic, but with room for improvement.
4: You have high confidence that the Answer addresses the Question, the Answer provides accurate information that addresses most of the Question.
5: You are extremely confident that the Answer addresses the Question, the Answer is highly accurate, relevant, and effectively addresses the Question in its entirety.
The output should strictly use the following template: Explanation: [provide a brief reasoning you used to derive the rating Score] and then write ‘Score: <rating>’ on the last line.

We also tried other prompt variants of the Self-Eval method, or having the LLM report confidence in its original answer on more continuous numeric scales (e.g., 1-10 or 1-100), but the resulting hallucination scores performed worse.

Evaluating Performance

To quantify the effectiveness of each strategy using the ground-truth in our benchmark, we primarily consider: How many wrong LLM responses can we catch under a limited review budget by prioritizing via trustworthiness scores?

This is evaluated in two ways (with higher values indicating better performance):

  • LLM response accuracy over only the 80% of responses in each dataset that received the highest trustworthiness scores (reported in the table above for GPT-4o).
  • Precision/recall for detecting incorrect LLM responses, measured using the Area under the Receiver Operating Characteristic Curve (AUROC).

Datasets

Our study considers popular Q&A datasets:

  • TriviaQA: Open-domain trivia questions.
  • ARC: Grade school multiple-choice questions (we consider the “Challenge Test” subset).
  • SVAMP: Elementary-level math word problems.
  • GSM8k: Grade school math problems.
  • Diagnosis: Classifying medical conditions based on symptom descriptions from patients.
Examples from benchmark where LLM responses are correct/wrong.Accordion Arrow

Examples from benchmark where LLM responses are correct

Prompt: If 6 potatoes makes 36 hash browns, how many hash browns can you make out of 96 potatoes?

LLM Response: 576             TLM Trustworthiness Score: 0.993

Prompt: You are a doctor looking at a patient’s symptoms. Classify the Symptoms into a single Diagnosis that best represents them. The list of available Diagnosis is: [cervical spondylosis, impetigo, urinary tract infection, arthritis, dengue, common cold, drug reaction, fungal infection, malaria, allergy, bronchial asthma, varicose veins, migraine, hypertension, gastroesophageal reflux disease, pneumonia, psoriasis, diabetes, jaundice, chicken pox, typhoid, peptic ulcer disease].
Symptoms: I have a feeling of food or acid backing up into my throat. I have chest pain which gets worse if I lie down. I get frequent heartburn or indigestion, after eating food and vomit it out.

LLM Response: gastroesophageal reflux disease             TLM Trustworthiness Score: 0.994

Examples from benchmark where LLM responses are wrong

Prompt: Emil is 19 years old now. When he turns 24, he will be half the age of his dad but twice as old as his brother. What is the sum of the ages of his dad and his brother now?

LLM Response: 65           TLM Trustworthiness Score: 0.123           (Ground-Truth Answer: 50)

Prompt: On a standard dartboard, which number lies opposite number 4?

LLM Response: 18             TLM Trustworthiness Score: 0.379           (Ground-Truth Answer: 16)

Prompt: You are a doctor looking at a patient’s symptoms. Classify the Symptoms into a single Diagnosis that best represents them. The list of available Diagnosis is: [cervical spondylosis, impetigo, urinary tract infection, arthritis, dengue, common cold, drug reaction, fungal infection, malaria, allergy, bronchial asthma, varicose veins, migraine, hypertension, gastroesophageal reflux disease, pneumonia, psoriasis, diabetes, jaundice, chicken pox, typhoid, peptic ulcer disease].
Symptoms: I have a severe headache that feels like pressure in my head. I also have a mild fever and small red spots on my back.

LLM Response: migraine       TLM Trustworthiness Score: 0.221       (True Answer: dengue)

Additional results for GPT 4o

The following table reports the AUROC achieved by each trustworthiness scoring strategy when applied with GPT-4o:

DatasetProbabilitySelf-EvalTLM
TriviaQA0.640.5740.817
ARC0.8150.6860.850
SVAMP0.6120.5890.788
GSM8k0.4240.5040.659
Diagnosis0.730.5960.722

Benchmark results for GPT 4o mini

To assess how effectively the TLM technology adds trust to arbitrary LLMs, we repeat the earlier benchmarks with a different base LLM model. In this section, we use OpenAI’s cheaper/faster GPT-4o mini LLM instead of GPT-4o. The following table reports the AUROC achieved by each trustworthiness scoring strategy when applied with GPT-4o mini:

DatasetProbabilitySelf-EvalTLM
TriviaQA0.7150.6780.809
ARC0.7540.7190.867
SVAMP0.8630.8380.933
GSM8k0.7290.8860.913
Diagnosis0.6680.6180.697

To reflect expected performance in applications where our AI can abstain from responding, the following table reports the accuracy of LLM responses whose associated trustworthiness score falls in the top 80% over each dataset (all based on GPT-4o mini):

DatasetProbabilitySelf-EvalTLM
TriviaQA84%81.5%87.5%
ARC95.7%95.4%97.7%
SVAMP93.3%96.4%97.4%
GSM8k79.6%84.2%82.1%
Diagnosis67.7%69%73.1%

Benchmark results for Claude 3 Haiku

We also repeat our study with another base LLM model: Anthropic’s Claude 3 Haiku. Here we do not benchmark against the Probability strategy, as Anthropic does not provide token probabilities from their model. The following table reports the AUROC achieved by each trustworthiness scoring strategy when applied with Claude 3 Haiku:

DatasetSelf-EvalTLM
TriviaQA0.5510.775
ARC0.5180.619
SVAMP0.5120.855
GSM8k0.4790.745
Diagnosis0.5380.679

To reflect expected performance in applications where our AI can abstain from responding, the following table reports the accuracy of LLM responses whose associated trustworthiness score falls in the top 80% over each dataset (all based on Claude 3 Haiku):

DatasetSelf-EvalTLM
TriviaQA76%82.3%
ARC85.2%87.1%
SVAMP93.1%96.9%
GSM8k86.8%92.1%
Diagnosis58.1%64%

Discussion

Across all datasets and LLM models, TLM trustworthiness scores consistently detect bad LLM responses with higher precision/recall than the Self-Eval or Probability scores. The latter strategies merely quantify limited forms of model uncertainty, whereas TLM is a universal uncertainty-quantification framework to catch all sorts of untrustworthy responses.

Use TLM to mitigate unchecked hallucination in any application – ideally with whichever base LLM model produces the best responses in your setting. Although current LLMs remain fundamentally unreliable, you now have a framework for delivering trustworthy AI!

Next Steps

  • Get started with the TLM API and run through various tutorials. Specify which base LLM model to use via the TLMOptions argument – all of the models listed in this article (and more) are supported out-of-the-box.

  • Demo TLM through our interactive playground.

  • This article showcased the generality of TLM across various base LLM models that our public API provides out-of-the-box. If you’d like a (private) version of TLM based on your own custom LLM, get in touch!

  • Learn more about TLM, and refer to our original blogpost for additional results and benchmarking details.

Related Blogs
Benchmarking Hallucination Detection Methods in RAG
Evaluating state-of-the-art tools to automatically catch incorrect responses from a RAG system.
Read morearrow
How to detect bad data in your instruction tuning dataset (for better LLM fine-tuning)
Overview of automated tools for catching: low-quality responses, incomplete/vague prompts, and other problematic text (toxic language, PII, informal writing, bad grammar/spelling) lurking in a instruction-response dataset. Here we reveal findings for the Dolly dataset.
Read morearrow
cleanlab now supports all major ML tasks — including Regression, Object Detection, and Image Segmentation
Introducing cleanlab v2.5, the long-awaited release that adds support for practicing Data-Centric AI in ML tasks requested by the most users.
Read morearrow
Get started today
Try Cleanlab Studio for free and automatically improve your dataset — no code required.
More resourcesarrow
Explore applications of Cleanlab Studio via blogs, tutorials, videos, and read the research that powers this next-generation platform.
Join us on Slackarrow
Join the Cleanlab Community to ask questions and see how scientists and engineers are practicing Data-Centric AI.