Automatically detecting LLM hallucinations with models like GPT-4o and Claude

The Trustworthy Language Model (TLM) is a system for reliable AI that adds a trustworthiness score to every LLM response. Compatible with any base LLM model, TLM now comes with out-of-the-box support for new models from OpenAI and Anthropic including: GPT-4o, GPT-4o mini, and Claude 3 Haiku.

This article shares comprehensive benchmarks comparing the hallucination detection performance of TLM against other hallucination-scoring strategies using these same LLM models:

The Self-Eval strategy asks the LLM to rate its own output in an additional request.
The Probability strategy aggregates token probabilities output by the model (this is not available for all LLMs though, for instance those from Anthropic or AWS Bedrock).

Dataset	Probability	Self-Eval	TLM
TriviaQA	89.8%	89.0%	94.8%
ARC	98.7%	97.8%	99.2%
SVAMP	96.3%	95.8%	97.7%
GSM8k	72.8%	74.5%	77.0%
Diagnosis	74.8%	73.6%	75.8%

The table above reports results obtained using OpenAI’s GPT-4o model with trustworthiness scores computed via one of three strategies: TLM, Self-Eval, or Probability. Each row lists the accuracy of LLM responses over 80% of the corresponding dataset, specifically the subset of examples that received the highest trustworthiness scores.

One framework for reliable AI is to have the system abstain from responding when estimated trustworthiness is too low, particularly in human-in-the-loop workflows where we only want LLMs to automate the subset of tasks that they can reliably handle. The results above show that trustworthiness scores from TLM yield significantly more reliable AI in such applications. For instance: TLM enables your team to ensure < 1% error rates over the ARC dataset while reviewing < 20% of the data, savings that are not achievable via other scoring techniques.

Below, we present additional benchmarks, while keeping this update concise. To learn more about TLM and these benchmarks, refer to our original blogpost, which provides all of the details and many additional results.

Hallucination Detection Benchmark

All of our benchmarks involve questions with a single correct answer, and benchmark performance is based on whether or not the LLM response matches this known ground-truth answer. This is unlike other LLM benchmarks that rely on noisy LLM-based evaluations. When presenting results for each specific base LLM model, only that model is used to produce responses and evaluate their trustworthiness – no other LLM is involved. All prompting and usage of the base LLM remains identical across all strategies studied here.

Each of the considered hallucination detection strategies produces a score for every LLM response. For instance, the Probability strategy scores responses by their perplexity, the average of the log token probabilities. The Self-Eval strategy asks the LLM to rate its confidence on a 1-5 Likert scale using Chain-of-Thought prompting.

Prompt used for Self-Eval strategy.

Question: {question}
Answer: {LLM response}
Evaluate how confident you are that the given Answer is a good and accurate response to the Question.
Please assign a Score using the following 5-point scale:
1: You are not confident that the Answer addresses the Question at all, the Answer may be entirely off-topic or irrelevant to the Question.
2: You have low confidence that the Answer addresses the Question, there are doubts and uncertainties about the accuracy of the Answer.
3: You have moderate confidence that the Answer addresses the Question, the Answer seems reasonably accurate and on-topic, but with room for improvement.
4: You have high confidence that the Answer addresses the Question, the Answer provides accurate information that addresses most of the Question.
5: You are extremely confident that the Answer addresses the Question, the Answer is highly accurate, relevant, and effectively addresses the Question in its entirety.
The output should strictly use the following template: Explanation: [provide a brief reasoning you used to derive the rating Score] and then write ‘Score: <rating>’ on the last line.

We also tried other prompt variants of the Self-Eval method, or having the LLM report confidence in its original answer on more continuous numeric scales (e.g., 1-10 or 1-100), but the resulting hallucination scores performed worse.

Evaluating Performance

To quantify the effectiveness of each strategy using the ground-truth in our benchmark, we primarily consider: How many wrong LLM responses can we catch under a limited review budget by prioritizing via trustworthiness scores?

This is evaluated in two ways (with higher values indicating better performance):

LLM response accuracy over only the 80% of responses in each dataset that received the highest trustworthiness scores (reported in the table above for GPT-4o).
Precision/recall for detecting incorrect LLM responses, measured using the Area under the Receiver Operating Characteristic Curve (AUROC).

Datasets

Our study considers popular Q&A datasets:

TriviaQA: Open-domain trivia questions.
ARC: Grade school multiple-choice questions (we consider the “Challenge Test” subset).
SVAMP: Elementary-level math word problems.
GSM8k: Grade school math problems.
Diagnosis: Classifying medical conditions based on symptom descriptions from patients.

Examples from benchmark where LLM responses are correct/wrong.

Examples from benchmark where LLM responses are correct

Prompt: If 6 potatoes makes 36 hash browns, how many hash browns can you make out of 96 potatoes?

LLM Response: 576 TLM Trustworthiness Score: 0.993

Prompt: You are a doctor looking at a patient’s symptoms. Classify the Symptoms into a single Diagnosis that best represents them. The list of available Diagnosis is: [cervical spondylosis, impetigo, urinary tract infection, arthritis, dengue, common cold, drug reaction, fungal infection, malaria, allergy, bronchial asthma, varicose veins, migraine, hypertension, gastroesophageal reflux disease, pneumonia, psoriasis, diabetes, jaundice, chicken pox, typhoid, peptic ulcer disease].
Symptoms: I have a feeling of food or acid backing up into my throat. I have chest pain which gets worse if I lie down. I get frequent heartburn or indigestion, after eating food and vomit it out.

LLM Response: gastroesophageal reflux disease TLM Trustworthiness Score: 0.994

Examples from benchmark where LLM responses are wrong

Prompt: Emil is 19 years old now. When he turns 24, he will be half the age of his dad but twice as old as his brother. What is the sum of the ages of his dad and his brother now?

LLM Response: 65 TLM Trustworthiness Score: 0.123 (Ground-Truth Answer: 50)

Prompt: On a standard dartboard, which number lies opposite number 4?

LLM Response: 18 TLM Trustworthiness Score: 0.379 (Ground-Truth Answer: 16)

Prompt: You are a doctor looking at a patient’s symptoms. Classify the Symptoms into a single Diagnosis that best represents them. The list of available Diagnosis is: [cervical spondylosis, impetigo, urinary tract infection, arthritis, dengue, common cold, drug reaction, fungal infection, malaria, allergy, bronchial asthma, varicose veins, migraine, hypertension, gastroesophageal reflux disease, pneumonia, psoriasis, diabetes, jaundice, chicken pox, typhoid, peptic ulcer disease].
Symptoms: I have a severe headache that feels like pressure in my head. I also have a mild fever and small red spots on my back.

LLM Response: migraine TLM Trustworthiness Score: 0.221 (True Answer: dengue)

Additional results for GPT 4o

The following table reports the AUROC achieved by each trustworthiness scoring strategy when applied with GPT-4o:

Dataset	Probability	Self-Eval	TLM
TriviaQA	0.64	0.574	0.817
ARC	0.815	0.686	0.850
SVAMP	0.612	0.589	0.788
GSM8k	0.424	0.504	0.659
Diagnosis	0.73	0.596	0.722

Benchmark results for GPT 4o mini

To assess how effectively the TLM technology adds trust to arbitrary LLMs, we repeat the earlier benchmarks with a different base LLM model. In this section, we use OpenAI’s cheaper/faster GPT-4o mini LLM instead of GPT-4o. The following table reports the AUROC achieved by each trustworthiness scoring strategy when applied with GPT-4o mini:

Dataset	Probability	Self-Eval	TLM
TriviaQA	0.715	0.678	0.809
ARC	0.754	0.719	0.867
SVAMP	0.863	0.838	0.933
GSM8k	0.729	0.886	0.913
Diagnosis	0.668	0.618	0.697

To reflect expected performance in applications where our AI can abstain from responding, the following table reports the accuracy of LLM responses whose associated trustworthiness score falls in the top 80% over each dataset (all based on GPT-4o mini):

Dataset	Probability	Self-Eval	TLM
TriviaQA	84%	81.5%	87.5%
ARC	95.7%	95.4%	97.7%
SVAMP	93.3%	96.4%	97.4%
GSM8k	79.6%	84.2%	82.1%
Diagnosis	67.7%	69%	73.1%

Benchmark results for Claude 3 Haiku

We also repeat our study with another base LLM model: Anthropic’s Claude 3 Haiku. Here we do not benchmark against the Probability strategy, as Anthropic does not provide token probabilities from their model. The following table reports the AUROC achieved by each trustworthiness scoring strategy when applied with Claude 3 Haiku:

Dataset	Self-Eval	TLM
TriviaQA	0.551	0.775
ARC	0.518	0.619
SVAMP	0.512	0.855
GSM8k	0.479	0.745
Diagnosis	0.538	0.679

Dataset	Self-Eval	TLM
TriviaQA	76%	82.3%
ARC	85.2%	87.1%
SVAMP	93.1%	96.9%
GSM8k	86.8%	92.1%
Diagnosis	58.1%	64%

Discussion

Across all datasets and LLM models, TLM trustworthiness scores consistently detect bad LLM responses with higher precision/recall than the Self-Eval or Probability scores. The latter strategies merely quantify limited forms of model uncertainty, whereas TLM is a universal uncertainty-quantification framework to catch all sorts of untrustworthy responses.

Use TLM to mitigate unchecked hallucination in any application – ideally with whichever base LLM model produces the best responses in your setting. Although current LLMs remain fundamentally unreliable, you now have a framework for delivering trustworthy AI!

Next Steps

Get started with the TLM API and run through various tutorials. Specify which base LLM model to use via the TLMOptions argument – all of the models listed in this article (and more) are supported out-of-the-box.
Chat with TLM (free).
This article showcased the generality of TLM across various base LLM models that our public API provides out-of-the-box. If you’d like a (private) version of TLM based on your own custom LLM, get in touch!
Learn more about TLM, and refer to our original blogpost for additional results and benchmarking details.

Browse all Next

Introducing an automated solution to ensure high-quality image data, for both content moderation and boosting engagement. Easily curate any product/content catalog or photo gallery to delight your customers.

An open-source platform to catch all sorts of issues in all sorts of datasets

With cleanlab v2.6, the most popular library for Data-Centric AI now offers more comprehensive data audits including new checks for underperforming groups, null values, imbalanced classes, and more.

Reliable Agentic RAG with LLM Trustworthiness Estimates

Ensure reliable answers in Retrieval-Augmented Generation, while also ensuring that latency and compute costs do not exceed the processing needed to accurately respond to complex queries.