The Trustworthy Language Model (TLM) is a system for reliable AI that adds a trustworthiness score to every LLM response. Compatible with any base LLM model, TLM now comes with out-of-the-box support for new models from OpenAI and Anthropic including: GPT-4o, GPT-4o mini, and Claude 3 Haiku. This article shares comprehensive benchmarks comparing the hallucination detection performance of TLM against other hallucination-scoring strategies using these same LLM models:
- The Self-Eval strategy asks the LLM to rate its own output in an additional request.
- The Probability strategy aggregates token probabilities output by the model (this is not available for all LLMs though, for instance those from Anthropic or AWS Bedrock).
Dataset | Probability | Self-Eval | TLM |
---|---|---|---|
TriviaQA | 89.8% | 89.0% | 94.8% |
ARC | 98.7% | 97.8% | 99.2% |
SVAMP | 96.3% | 95.8% | 97.7% |
GSM8k | 72.8% | 74.5% | 77.0% |
Diagnosis | 74.8% | 73.6% | 75.8% |
The table above reports results obtained using OpenAI’s GPT-4o model with trustworthiness scores computed via one of three strategies: TLM, Self-Eval, or Probability. Each row lists the accuracy of LLM responses over 80% of the corresponding dataset, specifically the subset of examples that received the highest trustworthiness scores.
One framework for reliable AI is to have the system abstain from responding when estimated trustworthiness is too low, particularly in human-in-the-loop workflows where we only want LLMs to automate the subset of tasks that they can reliably handle. The results above show that trustworthiness scores from TLM yield significantly more reliable AI in such applications. For instance: TLM enables your team to ensure < 1% error rates over the ARC dataset while reviewing < 20% of the data, savings that are not achievable via other scoring techniques.
Below, we present additional benchmarks, while keeping this update concise. To learn more about TLM and these benchmarks, refer to our original blogpost, which provides all of the details and many additional results.
Hallucination Detection Benchmark
All of our benchmarks involve questions with a single correct answer, and benchmark performance is based on whether or not the LLM response matches this known ground-truth answer. This is unlike other LLM benchmarks that rely on noisy LLM-based evaluations. When presenting results for each specific base LLM model, only that model is used to produce responses and evaluate their trustworthiness – no other LLM is involved. All prompting and usage of the base LLM remains identical across all strategies studied here.
Each of the considered hallucination detection strategies produces a score for every LLM response. For instance, the Probability strategy scores responses by their perplexity, the average of the log token probabilities. The Self-Eval strategy asks the LLM to rate its confidence on a 1-5 Likert scale using Chain-of-Thought prompting.
Prompt used for Self-Eval strategy.
Question: {question}
Answer: {LLM response}
Evaluate how confident you are that the given Answer is a good and accurate response to the Question.
Please assign a Score using the following 5-point scale:
1: You are not confident that the Answer addresses the Question at all, the Answer may be entirely off-topic or irrelevant to the Question.
2: You have low confidence that the Answer addresses the Question, there are doubts and uncertainties about the accuracy of the Answer.
3: You have moderate confidence that the Answer addresses the Question, the Answer seems reasonably accurate and on-topic, but with room for improvement.
4: You have high confidence that the Answer addresses the Question, the Answer provides accurate information that addresses most of the Question.
5: You are extremely confident that the Answer addresses the Question, the Answer is highly accurate, relevant, and effectively addresses the Question in its entirety.
The output should strictly use the following template: Explanation: [provide a brief reasoning you used to derive the rating Score] and then write ‘Score: <rating>’ on the last line.
We also tried other prompt variants of the Self-Eval method, or having the LLM report confidence in its original answer on more continuous numeric scales (e.g., 1-10 or 1-100), but the resulting hallucination scores performed worse.
Evaluating Performance
To quantify the effectiveness of each strategy using the ground-truth in our benchmark, we primarily consider: How many wrong LLM responses can we catch under a limited review budget by prioritizing via trustworthiness scores?
This is evaluated in two ways (with higher values indicating better performance):
- LLM response accuracy over only the 80% of responses in each dataset that received the highest trustworthiness scores (reported in the table above for GPT-4o).
- Precision/recall for detecting incorrect LLM responses, measured using the Area under the Receiver Operating Characteristic Curve (AUROC).
Datasets
Our study considers popular Q&A datasets:
- TriviaQA: Open-domain trivia questions.
- ARC: Grade school multiple-choice questions (we consider the “Challenge Test” subset).
- SVAMP: Elementary-level math word problems.
- GSM8k: Grade school math problems.
- Diagnosis: Classifying medical conditions based on symptom descriptions from patients.
Examples from benchmark where LLM responses are correct/wrong.
Examples from benchmark where LLM responses are correct
Prompt: If 6 potatoes makes 36 hash browns, how many hash browns can you make out of 96 potatoes?
LLM Response: 576 TLM Trustworthiness Score: 0.993
Prompt:
You are a doctor looking at a patient’s symptoms. Classify the Symptoms into a single Diagnosis that best represents them.
The list of available Diagnosis is: [cervical spondylosis, impetigo, urinary tract infection, arthritis, dengue, common cold, drug reaction, fungal infection, malaria, allergy, bronchial asthma, varicose veins, migraine, hypertension, gastroesophageal reflux disease, pneumonia, psoriasis, diabetes, jaundice, chicken pox, typhoid, peptic ulcer disease].
Symptoms: I have a feeling of food or acid backing up into my throat. I have chest pain which gets worse if I lie down. I get frequent heartburn or indigestion, after eating food and vomit it out.
LLM Response: gastroesophageal reflux disease TLM Trustworthiness Score: 0.994
Examples from benchmark where LLM responses are wrong
Prompt: Emil is 19 years old now. When he turns 24, he will be half the age of his dad but twice as old as his brother. What is the sum of the ages of his dad and his brother now?
LLM Response: 65 TLM Trustworthiness Score: 0.123
(Ground-Truth Answer: 50)
Prompt: On a standard dartboard, which number lies opposite number 4?
LLM Response: 18 TLM Trustworthiness Score: 0.379
(Ground-Truth Answer: 16)
Prompt:
You are a doctor looking at a patient’s symptoms. Classify the Symptoms into a single Diagnosis that best represents them.
The list of available Diagnosis is: [cervical spondylosis, impetigo, urinary tract infection, arthritis, dengue, common cold, drug reaction, fungal infection, malaria, allergy, bronchial asthma, varicose veins, migraine, hypertension, gastroesophageal reflux disease, pneumonia, psoriasis, diabetes, jaundice, chicken pox, typhoid, peptic ulcer disease].
Symptoms: I have a severe headache that feels like pressure in my head. I also have a mild fever and small red spots on my back.
LLM Response: migraine TLM Trustworthiness Score: 0.221
(True Answer: dengue)
Additional results for GPT 4o
The following table reports the AUROC achieved by each trustworthiness scoring strategy when applied with GPT-4o:
Dataset | Probability | Self-Eval | TLM |
---|---|---|---|
TriviaQA | 0.64 | 0.574 | 0.817 |
ARC | 0.815 | 0.686 | 0.850 |
SVAMP | 0.612 | 0.589 | 0.788 |
GSM8k | 0.424 | 0.504 | 0.659 |
Diagnosis | 0.73 | 0.596 | 0.722 |
Benchmark results for GPT 4o mini
To assess how effectively the TLM technology adds trust to arbitrary LLMs, we repeat the earlier benchmarks with a different base LLM model. In this section, we use OpenAI’s cheaper/faster GPT-4o mini LLM instead of GPT-4o. The following table reports the AUROC achieved by each trustworthiness scoring strategy when applied with GPT-4o mini:
Dataset | Probability | Self-Eval | TLM |
---|---|---|---|
TriviaQA | 0.715 | 0.678 | 0.809 |
ARC | 0.754 | 0.719 | 0.867 |
SVAMP | 0.863 | 0.838 | 0.933 |
GSM8k | 0.729 | 0.886 | 0.913 |
Diagnosis | 0.668 | 0.618 | 0.697 |
To reflect expected performance in applications where our AI can abstain from responding, the following table reports the accuracy of LLM responses whose associated trustworthiness score falls in the top 80% over each dataset (all based on GPT-4o mini):
Dataset | Probability | Self-Eval | TLM |
---|---|---|---|
TriviaQA | 84% | 81.5% | 87.5% |
ARC | 95.7% | 95.4% | 97.7% |
SVAMP | 93.3% | 96.4% | 97.4% |
GSM8k | 79.6% | 84.2% | 82.1% |
Diagnosis | 67.7% | 69% | 73.1% |
Benchmark results for Claude 3 Haiku
We also repeat our study with another base LLM model: Anthropic’s Claude 3 Haiku. Here we do not benchmark against the Probability strategy, as Anthropic does not provide token probabilities from their model. The following table reports the AUROC achieved by each trustworthiness scoring strategy when applied with Claude 3 Haiku:
Dataset | Self-Eval | TLM |
---|---|---|
TriviaQA | 0.551 | 0.775 |
ARC | 0.518 | 0.619 |
SVAMP | 0.512 | 0.855 |
GSM8k | 0.479 | 0.745 |
Diagnosis | 0.538 | 0.679 |
To reflect expected performance in applications where our AI can abstain from responding, the following table reports the accuracy of LLM responses whose associated trustworthiness score falls in the top 80% over each dataset (all based on Claude 3 Haiku):
Dataset | Self-Eval | TLM |
---|---|---|
TriviaQA | 76% | 82.3% |
ARC | 85.2% | 87.1% |
SVAMP | 93.1% | 96.9% |
GSM8k | 86.8% | 92.1% |
Diagnosis | 58.1% | 64% |
Discussion
Across all datasets and LLM models, TLM trustworthiness scores consistently detect bad LLM responses with higher precision/recall than the Self-Eval or Probability scores. The latter strategies merely quantify limited forms of model uncertainty, whereas TLM is a universal uncertainty-quantification framework to catch all sorts of untrustworthy responses.
Use TLM to mitigate unchecked hallucination in any application – ideally with whichever base LLM model produces the best responses in your setting. Although current LLMs remain fundamentally unreliable, you now have a framework for delivering trustworthy AI!
Next Steps
-
Get started with the TLM API and run through various tutorials. Specify which base LLM model to use via the TLMOptions argument – all of the models listed in this article (and more) are supported out-of-the-box.
-
Demo TLM through our interactive playground.
-
This article showcased the generality of TLM across various base LLM models that our public API provides out-of-the-box. If you’d like a (private) version of TLM based on your own custom LLM, get in touch!
-
Learn more about TLM, and refer to our original blogpost for additional results and benchmarking details.