Automatically detecting LLM hallucinations with models like GPT-4o and Claude

September 4, 2024
  • Hui Wen GohHui Wen Goh
  • Jay ZhangJay Zhang
  • Ulyana TkachenkoUlyana Tkachenko
  • Jonas MuellerJonas Mueller

The Trustworthy Language Model (TLM) is a system for reliable AI that adds a trustworthiness score to every LLM response. Compatible with any base LLM model, TLM now comes with out-of-the-box support for new models from OpenAI and Anthropic including: GPT-4o, GPT-4o mini, and Claude 3 Haiku.

Overview of trustworthiness scoring

This article shares comprehensive benchmarks comparing the hallucination detection performance of TLM against other hallucination-scoring strategies using these same LLM models:

  • The Self-Eval strategy asks the LLM to rate its own output in an additional request.
  • The Probability strategy aggregates token probabilities output by the model (this is not available for all LLMs though, for instance those from Anthropic or AWS Bedrock).
DatasetProbabilitySelf-EvalTLM
TriviaQA89.8%89.0%94.8%
ARC98.7%97.8%99.2%
SVAMP96.3%95.8%97.7%
GSM8k72.8%74.5%77.0%
Diagnosis74.8%73.6%75.8%

The table above reports results obtained using OpenAI’s GPT-4o model with trustworthiness scores computed via one of three strategies: TLM, Self-Eval, or Probability. Each row lists the accuracy of LLM responses over 80% of the corresponding dataset, specifically the subset of examples that received the highest trustworthiness scores.

One framework for reliable AI is to have the system abstain from responding when estimated trustworthiness is too low, particularly in human-in-the-loop workflows where we only want LLMs to automate the subset of tasks that they can reliably handle. The results above show that trustworthiness scores from TLM yield significantly more reliable AI in such applications. For instance: TLM enables your team to ensure < 1% error rates over the ARC dataset while reviewing < 20% of the data, savings that are not achievable via other scoring techniques.

Below, we present additional benchmarks, while keeping this update concise. To learn more about TLM and these benchmarks, refer to our original blogpost, which provides all of the details and many additional results.

Hallucination Detection Benchmark

All of our benchmarks involve questions with a single correct answer, and benchmark performance is based on whether or not the LLM response matches this known ground-truth answer. This is unlike other LLM benchmarks that rely on noisy LLM-based evaluations. When presenting results for each specific base LLM model, only that model is used to produce responses and evaluate their trustworthiness – no other LLM is involved. All prompting and usage of the base LLM remains identical across all strategies studied here.

Each of the considered hallucination detection strategies produces a score for every LLM response. For instance, the Probability strategy scores responses by their perplexity, the average of the log token probabilities. The Self-Eval strategy asks the LLM to rate its confidence on a 1-5 Likert scale using Chain-of-Thought prompting.

Evaluating Performance

To quantify the effectiveness of each strategy using the ground-truth in our benchmark, we primarily consider: How many wrong LLM responses can we catch under a limited review budget by prioritizing via trustworthiness scores?

This is evaluated in two ways (with higher values indicating better performance):

  • LLM response accuracy over only the 80% of responses in each dataset that received the highest trustworthiness scores (reported in the table above for GPT-4o).
  • Precision/recall for detecting incorrect LLM responses, measured using the Area under the Receiver Operating Characteristic Curve (AUROC).

Datasets

Our study considers popular Q&A datasets:

  • TriviaQA: Open-domain trivia questions.
  • ARC: Grade school multiple-choice questions (we consider the “Challenge Test” subset).
  • SVAMP: Elementary-level math word problems.
  • GSM8k: Grade school math problems.
  • Diagnosis: Classifying medical conditions based on symptom descriptions from patients.

Additional results for GPT 4o

The following table reports the AUROC achieved by each trustworthiness scoring strategy when applied with GPT-4o:

DatasetProbabilitySelf-EvalTLM
TriviaQA0.640.5740.817
ARC0.8150.6860.850
SVAMP0.6120.5890.788
GSM8k0.4240.5040.659
Diagnosis0.730.5960.722

Benchmark results for GPT 4o mini

To assess how effectively the TLM technology adds trust to arbitrary LLMs, we repeat the earlier benchmarks with a different base LLM model. In this section, we use OpenAI’s cheaper/faster GPT-4o mini LLM instead of GPT-4o. The following table reports the AUROC achieved by each trustworthiness scoring strategy when applied with GPT-4o mini:

DatasetProbabilitySelf-EvalTLM
TriviaQA0.7150.6780.809
ARC0.7540.7190.867
SVAMP0.8630.8380.933
GSM8k0.7290.8860.913
Diagnosis0.6680.6180.697

To reflect expected performance in applications where our AI can abstain from responding, the following table reports the accuracy of LLM responses whose associated trustworthiness score falls in the top 80% over each dataset (all based on GPT-4o mini):

DatasetProbabilitySelf-EvalTLM
TriviaQA84%81.5%87.5%
ARC95.7%95.4%97.7%
SVAMP93.3%96.4%97.4%
GSM8k79.6%84.2%82.1%
Diagnosis67.7%69%73.1%

Benchmark results for Claude 3 Haiku

We also repeat our study with another base LLM model: Anthropic’s Claude 3 Haiku. Here we do not benchmark against the Probability strategy, as Anthropic does not provide token probabilities from their model. The following table reports the AUROC achieved by each trustworthiness scoring strategy when applied with Claude 3 Haiku:

DatasetSelf-EvalTLM
TriviaQA0.5510.775
ARC0.5180.619
SVAMP0.5120.855
GSM8k0.4790.745
Diagnosis0.5380.679

To reflect expected performance in applications where our AI can abstain from responding, the following table reports the accuracy of LLM responses whose associated trustworthiness score falls in the top 80% over each dataset (all based on Claude 3 Haiku):

DatasetSelf-EvalTLM
TriviaQA76%82.3%
ARC85.2%87.1%
SVAMP93.1%96.9%
GSM8k86.8%92.1%
Diagnosis58.1%64%

Discussion

Across all datasets and LLM models, TLM trustworthiness scores consistently detect bad LLM responses with higher precision/recall than the Self-Eval or Probability scores. The latter strategies merely quantify limited forms of model uncertainty, whereas TLM is a universal uncertainty-quantification framework to catch all sorts of untrustworthy responses.

Use TLM to mitigate unchecked hallucination in any application – ideally with whichever base LLM model produces the best responses in your setting. Although current LLMs remain fundamentally unreliable, you now have a framework for delivering trustworthy AI!

Next Steps

  • Get started with the TLM API and run through various tutorials. Specify which base LLM model to use via the TLMOptions argument – all of the models listed in this article (and more) are supported out-of-the-box.

  • Chat with TLM (free).

  • This article showcased the generality of TLM across various base LLM models that our public API provides out-of-the-box. If you’d like a (private) version of TLM based on your own custom LLM, get in touch!

  • Learn more about TLM, and refer to our original blogpost for additional results and benchmarking details.

Related Blogs
Automatically boost the accuracy of any LLM, without changing your prompts or the model
Demonstrating how the Trustworthy Language Model system can produce better responses from a wide variety of LLMs
Read more
Prevent Hallucinated Responses from any AI Agent
A case study on a reliable Customer Support Agent built with LangGraph and automated trustworthiness scoring
Read more