Cleanlab’s new Trustworthy Language Model (TLM) overcomes the biggest barrier to enterprise adoption of LLMs: hallucinations and reliability. By adding a trust score to every LLM response, TLM helps you automatically catch bogus LLM outputs. This enables you to deploy generative AI for new use cases previously unsuitable for LLMs. Rigorous benchmarking shows that: TLM has better-calibrated trustworthiness scores (enabling greater cost/time savings) than existing approaches to detect LLM errors, and TLM can utilize these trustworthiness scores to produce more accurate responses than existing LLMs.
Try out the TLM API for free, or play with TLM via our interactive demo.
LLMs’ biggest challenge: hallucinations
A recent Gartner poll shows that while 55% of organizations are experimenting with generative AI, only 10% have put generative AI into production. A major barrier to productionizing LLMs is their occasional tendency to produce bogus outputs known as hallucinations, which precludes their use in applications where correct outputs are necessary (i.e., most applications)!
Despite their brittle nature, organizations have deployed LLMs, sometimes with catastrophic results. Air Canada’s chatbot hallucinated refund policies, resulting in the airline being held responsible for the misinformation and monetary penalties; the chatbot has since been taken down. A federal judge fined a law firm after their lawyers used ChatGPT to draft a brief full of fabricated citations. New York City’s “MyCity” chatbot has been hallucinating wrong answers to business owners’ questions about local laws.
Overcoming hallucinations with trustworthiness scores
LLMs will always exhibit occassional hallucinations and incorrect responses, but by providing a trustworthiness score with every output, Cleanlab TLM lets you identify when the LLM is hallucinating. The TLM API can serve as:
- A drop-in replacement for your LLM. Like existing LLM APIs, TLM provides a
.prompt()
method that will return a response along with a trustworthiness score, enabling more reliable AI deployments.- Even the responses themselves are more accurate than the baseline LLM model when using certain TLM quality presets which internally produce many responses and output the one with the highest trustworthiness score.
- A layer of trust for your existing LLM outputs or human-generated data. TLM provides a
.get_trustworthiness_score()
method that can score any prompt/response pair to detect bad/wrong responses in real-time.
TLM works by augmenting existing LLMs with a layer of trust. The generally-available version of TLM lets you choose between a number of popular base models, including GPT-4o, GPT-4o mini, GPT-4, GPT-3.5, o1-preview, Claude 3 and 3.5 Sonnet, but TLM can augment any LLM. For enterprise use cases, such as adding trustworthiness to your custom fine-tuned LLM, contact us.
Berkeley Research Group (BRG) has already seen significant cost savings from leveraging TLM. According to Steven Gawthorpe, PhD, Associate Director and Senior Data Scientist at BRG:
While there are always other tools out there, Cleanlab’s TLM is the first viable answer to LLM hallucinations that I’ve seen. Several of our human-in-the-loop LLM workflows can now be 80% automated with Cleanlab’s trustworthiness scores on every LLM output. Doing this manually for the entire dataset is often impossible, but Cleanlab gives us the power of 1000s of data scientists to enrich data and strengthen LLM outputs. The downstream cost savings of using TLM for accurate data are substantial, providing significant financial benefits with 10x to 100x ROI for many of our clients. Other tools on the market aren’t even on the same playing field compared to what Cleanlab is doing.
Use cases enabled by TLM
Trustworthiness scores unlock new production use cases of LLMs, and any existing application of LLMs can also benefit by taking into account these scores.
Customer service chatbot
TLM powers trustworthy chatbots that answer the 80% of questions where they are confident, but escalate to a human if they’re unsure about a response rather than hallucinating one (like in the Air Canada case). This is done simply by routing the question to a human when the trustworthiness score falls below a chosen threshold. If human escalation is not possible, untrustworthy responses can at least visually be flagged (as done in our playground).
Auto-labeling
LLMs are commonly used for auto-labeling data. With TLM, you can confidently auto-label a large fraction of your data and only have humans review a portion of the data where the LLM does not return trustworthy results.
template = '''
What type of compliance issue is most likely present in the following document?
Please restrict your answer to a one word answer and nothing else.
Your answer should be selected from the following options: HIPAA, FERPA, GDPR, none.
Document below here:
{document}
'''
def classify(document) -> Tuple[str, float]:
answer = tlm.prompt(template.format(document=document))
return answer['response'], answer['trustworthiness_score']
Using this prompt to classify a large number of legal documents, we see that the documents with high trustworthiness scores were labeled correctly, while the documents with low scores often received erroneous labels that needed double-checking:
document | response | trustworthiness |
---|---|---|
All medical health records will be accessed one way only. The patient’s medical data will be stored on unencrypted public servers at the discretion of the enterprise customer. | HIPAA | 0.984 |
⋮ | ⋮ | ⋮ |
TechTarget’s Cookies Policy includes the following terminology: “By continuing to use the site, you agree to the use of cookies.” | FERPA | 0.426 |
For more on this use case, see the TLM auto-labeling tutorial.
Data extraction
TLM can also be used for open-domain data extraction. Our TLM information extraction tutorial walks through an example use case of extracting key information from electronics parts datasheet PDFs like the following:
If you were populating a parts catalog, you might be interested in extracting information like operating voltage from such documents, where TLM’s trustworthiness scores can automatically separate correctly extracted values from those that are wrong:
part | operating voltage | trustworthiness |
---|---|---|
ATtiny44A | 1.8 - 5.5V | 0.937 |
⋮ | ⋮ | ⋮ |
ZRE200GE | 1V - 15V DC | 0.567 |
… and more
The examples above just scratch the surface of reliable AI applications that become possible with TLM. We’re continually adding hands on tutorials for new applications of TLM, such as:
- Trustworthy retrieval-augmented generation (RAG)
- Detecting bad instruction-tuning data
- Turning your own LLM into a TLM (Llama-3 example)
Explaining why a particular response is deemed untrustworthy
You can use TLM to not only catch hallucinations, but understand them better as well:
Evaluating TLM Performance
We evaluate TLM’s ability to add trust to arbitrary LLMs by benchmarking TLM against OpenAI’s GPT-4 LLM (and many other models in the Appendix). Our comprehensive benchmarks investigate two questions to evaluate the reliability of TLM’s (1) responses, and (2) trustworthiness scores:
- How accurate are TLM responses compared to the baseline LLM?
- To meet a required error rate by flagging low-scoring LLM responses for human review, how much costs/time does a team save by scoring responses via TLM vs. existing confidence estimation approaches?
The second item can be rephrased as: How many wrong LLM responses can we catch under a limited review budget by prioritizing via trustworthiness scores? When investigating this, we compare against two popular approaches to estimate the confidence of the baseline LLM:
- Self-Eval: Asking the LLM to evaluate its own output (e.g., rate its confidence on a scale of 1-5). This is a subsequent LLM-as-judge request to the model (details in Appendix).
- Probability: Relying on the probability of the generated output given by the language model, as recommended by OpenAI. This is called the perplexity in AI research, and is the average log probability of tokens in the LLM response, obtained from the raw output of the underlying autoregressive neural network.
Both of these confidence measures merely quantify the aleatoric uncertainty (known unknowns) in model predictions. This is uncertainty the model is aware of due to a known challenging prompt (e.g., incomplete/vague request). TLM’s trustworthiness score additionally quantifies epistemic uncertainty (unknown unknowns), which arises when the model was not previously trained on data similar to a given request.
Benchmark datasets
Our study focuses on Q&A settings. Unlike other LLM benchmarks, we never measure benchmark performance using LLM-based evaluations. All of our benchmarks involve questions with a single correct answer, and benchmark performance is based on whether or not the LLM response matches this known ground-truth answer. We consider these popular Q&A datasets:
- TriviaQA: Open-domain trivia questions.
- ARC: Grade school multiple-choice questions (we consider the “Challenge Test” subset).
- SVAMP: Elementary-level math word problems.
- GSM8k: Grade school math problems.
- Diagnosis: Diagnosing medical conditions based on symptom descriptions from the patient.
The next sections show some benchmark examples and the corresponding TLM outputs.
Examples from benchmark where TLM responded correctly
Prompt: If 6 potatoes makes 36 hash browns, how many hash browns can you make out of 96 potatoes?
TLM Output: 576 Trustworthiness Score: 0.993
Prompt:
You are a doctor looking at a patient’s symptoms. Classify the Symptoms into a single Diagnosis that best represents them.
The list of available Diagnosis is: [cervical spondylosis, impetigo, urinary tract infection, arthritis, dengue, common cold, drug reaction, fungal infection, malaria, allergy, bronchial asthma, varicose veins, migraine, hypertension, gastroesophageal reflux disease, pneumonia, psoriasis, diabetes, jaundice, chicken pox, typhoid, peptic ulcer disease].
Symptoms: I have a feeling of food or acid backing up into my throat. I have chest pain which gets worse if I lie down. I get frequent heartburn or indigestion, after eating food and vomit it out.
TLM Output: gastroesophageal reflux disease Trustworthiness Score: 0.994
Examples from benchmark where TLM responded incorrectly
Prompt: Emil is 19 years old now. When he turns 24, he will be half the age of his dad but twice as old as his brother. What is the sum of the ages of his dad and his brother now?
TLM Output: 65 Trustworthiness Score: 0.123
(Ground-Truth Answer: 50)
Prompt: On a standard dartboard, which number lies opposite number 4?
TLM Output: 18 Trustworthiness Score: 0.379
(Ground-Truth Answer: 16)
Prompt:
You are a doctor looking at a patient’s symptoms. Classify the Symptoms into a single Diagnosis that best represents them.
The list of available Diagnosis is: [cervical spondylosis, impetigo, urinary tract infection, arthritis, dengue, common cold, drug reaction, fungal infection, malaria, allergy, bronchial asthma, varicose veins, migraine, hypertension, gastroesophageal reflux disease, pneumonia, psoriasis, diabetes, jaundice, chicken pox, typhoid, peptic ulcer disease].
Symptoms: I have a severe headache that feels like pressure in my head. I also have a mild fever and small red spots on my back.
TLM Output: migraine Trustworthiness Score: 0.221
(Ground-Truth Answer: dengue)
Benchmark Results
The following table reports the accuracy of responses from TLM and GPT-4 across each benchmark dataset:
Dataset | OpenAI GPT-4 API | Cleanlab TLM API |
---|---|---|
TriviaQA | 84.7% | 84.8% |
ARC | 94.6% | 94.9% |
SVAMP | 90.7% | 91.7% |
GSM8k | 46.5% | 55.6% |
Diagnosis | 67.4% | 68.0% |
Here TLM is using GPT-4 as a base model, and can consistently improve the accuracy of the baseline GPT-4 LLM across all datasets.
Next, we evaluate the three aforementioned approaches to estimate trustworthiness scores for each LLM response (again using GPT-4 as the baseline LLM): TLM, Self-Eval, Probability. The following plot reports the error rate of LLM responses amongst the top-K% of responses with the highest trustworthiness scores in each dataset:
Across all datasets, TLM trustworthiness scores allow us to more reliably detect bad LLM responses than the Self-Eval or Probability scores. If a team has to ensure a max-acceptable error rate by manually reviewing the low-scoring LLM responses, enormous reviewing costs/time can be saved by adopting TLM scores. For instance, a team could achieve near-zero error rates for the SVAMP dataset by only inspecting ~20% of the LLM responses when relying on TLM trustworthiness, but would have to inspect nearly 40% or 90% of the data when relying on Probability or Self-Eval scores.
We additionally evaluate the utility of these trustworthiness scores via: the probability that LLM response #1 receives a higher trustworthiness score than LLM response #2, where the former is randomly selected from the subset of model responses that were correct, and the latter from the subset of model responses that were incorrect. Widely used to assess diagnostic scores, this evaluation metric is known as the Area under the Receiver Operating Characteristic Curve (AUROC). The following table reports the AUROC achieved by each trustworthiness scoring method in each dataset (again using GPT-4 as the baseline LLM):
Dataset | Probability | Self-Eval | TLM |
---|---|---|---|
TriviaQA | 0.704 | 0.623 | 0.812 |
ARC | 0.755 | 0.659 | 0.861 |
SVAMP | 0.943 | 0.793 | 0.973 |
GSM8k | 0.883 | 0.868 | 0.994 |
Diagnosis | 0.614 | 0.654 | 0.711 |
Additional benchmarks are presented in the Appendix, in particular with other versions of TLM built around GPT-4o mini, GPT-4o, GPT-3.5, o1-preview, Claude 3 and 3.5 Sonnet instead of GPT-4. The benchmarks reveal that TLM can reduce the error rate (incorrect answers): of GPT-4 by up to 10%, of GPT-4o by up to 27%, of GPT-4o mini by up to 34%, of GPT-3.5 by up to 22%, of o1-preview by up to 20%, of Claude 3 Haiku by up to 24%, and of Claude 3.5 Sonnet by up to 20%. The trustworthiness estimates output by TLM are significantly more effective for catching bad answers, across different evaluation metrics, datasets, and LLMs.
Conclusion
This article shows how the TLM technology can boost the reliability of any LLM application. Use TLM trustworthiness scores to automatically catch bad outputs from any LLM in real-time. Additionally use TLM to produce more accurate responses than any base LLM model. You can use Cleanlab’s TLM built on top of popular base LLMs, or contact us to convert your own LLM into a TLM (requires no additional training of the LLM or access to its training data or model weights).
Of course, there’s no free lunch. TLM requires extra computation in order to provide these benefits. It internally calls the underlying LLM multiple times to self-reflect on candidate responses, compute probabilistic measures, assess the semantic consistency between candidate responses. Learn more via the documentation. TLM is thus most useful for higher-stakes AI applications that require reliability and no unchecked hallucinations.
Resources
- Play with TLM via our interactive demo.
- Run the actual TLM API for free and try various tutorial use-cases.
- Read about TLM in today’s News.
Appendix
Expand each collapsible section below to learn more.
Additional GPT 4 benchmark results.
Here we supplement our AUROC evaluation of various trustworthiness scores’ utility with an additional evaluation metric. When we see a higher trustworthiness score, AUROC intuitively quantifies how much more confident can we truly be that the LLM answer is actually correct. Ideally, we’d like trustworthiness scores near 1 for LLM responses that are correct and near 0 for incorrect responses. However, AUROC does not quantify how different we can expect trustworthiness scores to look for correct vs. incorrect LLM answers.
The table below reports a measure of this separation via the Confidence Gap, defined as the difference between two averages. The first average is taken over the trustworthiness scores for LLM responses that were correct, the latter over the scores for incorrect responses.
Dataset | Probability | Self-Eval | TLM |
---|---|---|---|
TriviaQA | 0.0714 | 0.123 | 0.219 |
ARC | 0.0193 | 0.23 | 0.316 |
SVAMP | 0.208 | 0.566 | 0.633 |
GSM8k | 0.347 | 0.707 | 0.772 |
Diagnosis | 0.029 | 0.159 | 0.164 |
Benchmark results for GPT 4o.
To assess how effectively the TLM technology adds trust to arbitrary LLMs, we repeat the earlier benchmarks with a different baseline LLM model. In this section, we use OpenAI’s GPT-4o LLM instead of GPT-4. Our TLM implementation on top of GPT-4o solely relies on this baseline LLM and no other LLM model. See the earlier sections for definitions of each evaluation metric presented here.
The following table reports the accuracy of responses from TLM and GPT-4o across each benchmark dataset:
Dataset | GPT-4o | TLM |
---|---|---|
TriviaQA | 88.2% | 89.2% |
ARC | 96.6% | 96.7% |
SVAMP | 95.0% | 95.2% |
GSM8k | 74.1% | 81.2% |
Diagnosis | 68.9% | 69.3% |
The following table reports the AUROC achieved by each trustworthiness scoring method when applied with GPT-4o:
Dataset | Probability | Self-Eval | TLM |
---|---|---|---|
TriviaQA | 0.64 | 0.574 | 0.817 |
ARC | 0.815 | 0.686 | 0.850 |
SVAMP | 0.612 | 0.589 | 0.788 |
GSM8k | 0.424 | 0.504 | 0.659 |
Diagnosis | 0.73 | 0.596 | 0.722 |
The following table reports the Confidence Gap achieved by each trustworthiness scoring method when applied with GPT-4o:
Dataset | Probability | Self-Eval | TLM |
---|---|---|---|
TriviaQA | 0.023 | 0.122 | 0.222 |
ARC | 0.060 | 0.336 | 0.333 |
SVAMP | 0.019 | 0.120 | 0.076 |
GSM8k | -0.005 | 0.003 | 0.020 |
Diagnosis | 0.023 | 0.103 | 0.163 |
To reflect expected performance in applications where our AI can abstain from responding, the following table reports the accuracy of LLM responses whose associated trustworthiness score falls in the top 80% over each dataset (all based on GPT-4o):
Dataset | Probability | Self-Eval | TLM |
---|---|---|---|
TriviaQA | 89.8% | 89.0% | 94.8% |
ARC | 98.7% | 97.8% | 99.2% |
SVAMP | 96.3% | 95.8% | 97.7% |
GSM8k | 72.8% | 74.5% | 77.0% |
Diagnosis | 74.8% | 73.6% | 75.8% |
Benchmark results for GPT 4o mini.
To assess how effectively the TLM technology adds trust to arbitrary LLMs, we repeat the earlier benchmarks with a different baseline LLM model. In this section, we use OpenAI’s GPT-4o mini LLM instead of GPT-4. Our TLM implementation on top of GPT-4o mini solely relies on this baseline LLM and no other LLM model. See the earlier sections for definitions of each evaluation metric presented here.
The following table reports the accuracy of responses from TLM and GPT-4o mini across each benchmark dataset:
Dataset | GPT-4o mini | TLM |
---|---|---|
TriviaQA | 78% | 79.5% |
ARC | 92.7% | 93.4% |
SVAMP | 86.9% | 88.7% |
GSM8k | 68% | 79% |
Diagnosis | 62.4% | 62.9% |
The following table reports the AUROC achieved by each trustworthiness scoring method when applied with GPT-4o mini:
Dataset | Probability | Self-Eval | TLM |
---|---|---|---|
TriviaQA | 0.715 | 0.678 | 0.809 |
ARC | 0.754 | 0.719 | 0.867 |
SVAMP | 0.863 | 0.838 | 0.933 |
GSM8k | 0.729 | 0.886 | 0.913 |
Diagnosis | 0.668 | 0.618 | 0.697 |
The following table reports the Confidence Gap achieved by each trustworthiness scoring method when applied with GPT-4o mini:
Dataset | Probability | Self-Eval | TLM |
---|---|---|---|
TriviaQA | 0.044 | 0.176 | 0.236 |
ARC | 0.061 | 0.264 | 0.355 |
SVAMP | 0.076 | 0.49 | 0.327 |
GSM8k | 0.057 | 0.523 | 0.311 |
Diagnosis | 0.018 | 0.104 | 0.130 |
To reflect expected performance in applications where our AI can abstain from responding, the following table reports the accuracy of LLM responses whose associated trustworthiness score falls in the top 80% over each dataset (all based on GPT-4o mini):
Dataset | Probability | Self-Eval | TLM |
---|---|---|---|
TriviaQA | 84% | 81.5% | 87.5% |
ARC | 95.7% | 95.4% | 97.7% |
SVAMP | 93.3% | 96.4% | 97.4% |
GSM8k | 79.6% | 84.2% | 82.1% |
Diagnosis | 67.7% | 69% | 73.1% |
Benchmark results for GPT 3.5.
To assess how effectively the TLM technology adds trust to arbitrary LLMs, we repeat the earlier benchmarks with a different baseline LLM model. In this section, we use OpenAI’s GPT-3.5 LLM instead of GPT-4. Our TLM implementation on top of GPT-3.5 solely relies on this baseline LLM. Neither GPT-4 nor any other powerful LLM is used in any of the results presented in this section. See the earlier sections for definitions of each evaluation metric presented here.
The following table reports the accuracy of responses from TLM and GPT-3.5 across each benchmark dataset:
Dataset | GPT-3.5 | TLM |
---|---|---|
TriviaQA | 73.0% | 75.2% |
ARC | 82.2% | 85.6% |
SVAMP | 79.6% | 84.5% |
GSM8k | 68.8% | 76.7% |
Diagnosis | 58.3% | 58.6% |
The following table reports the AUROC achieved by each trustworthiness scoring method when applied with GPT-3.5:
Dataset | Probability | Self-Eval | TLM |
---|---|---|---|
TriviaQA | 0.648 | 0.617 | 0.837 |
ARC | 0.739 | 0.604 | 0.902 |
SVAMP | 0.572 | 0.594 | 0.886 |
GSM8k | 0.726 | 0.559 | 0.773 |
Diagnosis | 0.611 | 0.570 | 0.733 |
The following table reports the Confidence Gap achieved by each trustworthiness scoring method when applied with GPT-3.5:
Dataset | Probability | Self-Eval | TLM |
---|---|---|---|
TriviaQA | 0.0587 | 0.171 | 0.272 |
ARC | 0.0314 | 0.0882 | 0.381 |
SVAMP | 0.0424 | 0.113 | 0.349 |
GSM8k | 0.0829 | 0.0794 | 0.219 |
Diagnosis | 0.025 | 0.071 | 0.124 |
To reflect expected performance in applications where our AI can abstain from responding, the following table reports the accuracy of LLM responses whose associated trustworthiness score falls in the top 80% over each dataset (all based on GPT-3.5):
Dataset | Probability | Self-Eval | TLM |
---|---|---|---|
TriviaQA | 76.9% | 77.9% | 83.5% |
ARC | 86.1% | 84.1% | 94.8% |
SVAMP | 80.8% | 82.6% | 91.8% |
GSM8k | 75.9% | 72.3% | 78.3% |
Diagnosis | 60.8% | 62.3% | 64.8% |
Benchmark results for Claude 3 Haiku.
To assess how effectively the TLM technology adds trust to arbitrary LLMs, we repeat the earlier benchmarks with a different baseline LLM model. In this section, we use Anthropic’s Claude 3 Haiku LLM instead of GPT-4. Our TLM implementation on top of Claude 3 Haiku solely relies on this baseline LLM and no other LLM model. See the earlier sections for definitions of each evaluation metric presented here. Note that we do not benchmark against the Probability method here, as Anthropic does not provide access to token probabilities from their model.
The following table reports the accuracy of responses from TLM and Claude 3 Haiku across each benchmark dataset:
Dataset | Claude 3 Haiku | TLM |
---|---|---|
TriviaQA | 75.3% | 76.5% |
ARC | 84.7% | 85.5% |
SVAMP | 93% | 94.7% |
GSM8k | 87.3% | 90.4% |
Diagnosis | 56% | 56.1% |
The following table reports the AUROC achieved by each trustworthiness scoring method when applied with Claude 3 Haiku:
Dataset | Self-Eval | TLM |
---|---|---|
TriviaQA | 0.551 | 0.775 |
ARC | 0.518 | 0.619 |
SVAMP | 0.512 | 0.855 |
GSM8k | 0.479 | 0.745 |
Diagnosis | 0.538 | 0.679 |
The following table reports the Confidence Gap achieved by each trustworthiness scoring method when applied with Claude 3 Haiku:
Dataset | Self-Eval | TLM |
---|---|---|
TriviaQA | 0.039 | 0.1 |
ARC | 0.009 | 0.041 |
SVAMP | 0.006 | 0.151 |
GSM8k | -0.01 | 0.104 |
Diagnosis | 0.029 | 0.135 |
To reflect expected performance in applications where our AI can abstain from responding, the following table reports the accuracy of LLM responses whose associated trustworthiness score falls in the top 80% over each dataset (all based on Claude 3 Haiku):
Dataset | Self-Eval | TLM |
---|---|---|
TriviaQA | 76% | 82.3% |
ARC | 85.2% | 87.1% |
SVAMP | 93.1% | 96.9% |
GSM8k | 86.8% | 92.1% |
Diagnosis | 58.1% | 64% |
Benchmark results for Claude 3.5 Sonnet.
To assess how effectively the TLM technology adds trust to arbitrary LLMs, we repeat the earlier benchmarks with a different baseline LLM model. In this section, we use Anthropic’s Claude 3.5 Sonnet LLM instead of GPT-4. Our TLM implementation on top of Claude 3.5 Sonnet solely relies on this baseline LLM and no other LLM model. See the earlier sections for definitions of each evaluation metric presented here. Note that we do not benchmark against the Probability method here, as Anthropic does not provide access to token probabilities from their model.
The following table reports the accuracy of responses from TLM and Claude 3.5 Sonnet across each benchmark dataset:
Dataset | Claude 3.5 Sonnet | TLM |
---|---|---|
TriviaQA | 81.5% | 85.3% |
ARC | 94% | 94.8% |
SVAMP | 95.6% | 96.2% |
GSM8k | 95.1% | 95.7% |
Diagnosis | 67.6% | 68.3% |
The following table reports the AUROC achieved by each trustworthiness scoring method when applied with Claude 3.5 Sonnet:
Dataset | Self-Eval | TLM |
---|---|---|
TriviaQA | 0.747 | 0.83 |
ARC | 0.7 | 0.894 |
SVAMP | 0.577 | 0.833 |
GSM8k | 0.56 | 0.659 |
Diagnosis | 0.713 | 0.713 |
The following table reports the Confidence Gap achieved by each trustworthiness scoring method when applied with Claude 3.5 Sonnet:
Dataset | Self-Eval | TLM |
---|---|---|
TriviaQA | 0.189 | 0.12 |
ARC | 0.286 | 0.554 |
SVAMP | 0.067 | 0.239 |
GSM8k | 0.075 | 0.131 |
Diagnosis | 0.197 | 0.18 |
To reflect expected performance in applications where our AI can abstain from responding, the following table reports the accuracy of LLM responses whose associated trustworthiness score falls in the top 70% over each dataset (all based on Claude 3.5 Sonnet):
Dataset | Self-Eval | TLM |
---|---|---|
TriviaQA | 89.8% | 92.4% |
ARC | 96.6% | 98.5% |
SVAMP | 96.3% | 98.7% |
GSM8k | 95.9% | 96.7% |
Diagnosis | 76.4% | 77.1% |
Benchmark results for o1.
Additional benchmarks with OpenAI’s o1-preview LLM are available here.
Additional benchmarking details.
Benchmarks of TLM accuracy were run using the best
quality preset for the TLM, which samples multiple candidate responses and returns the one with highest trusworthiness score. Benchmarks of the TLM trustworthiness score were run using default settings, which do not attempt to improve the response from the baseline LLM and merely score its trustworthiness. In all benchmarks, the TLM never accessed a more powerful LLM than the baseline model being compared against.
The prompt used for the Self-Eval method, via a separate request to have the LLM evaluate its previous response was:
Question: {question}
Answer: {LLM response}
Evaluate how confident you are that the given Answer is a good and accurate response to the Question.
Please assign a Score using the following 5-point scale:
1: You are not confident that the Answer addresses the Question at all, the Answer may be entirely off-topic or irrelevant to the Question.
2: You have low confidence that the Answer addresses the Question, there are doubts and uncertainties about the accuracy of the Answer.
3: You have moderate confidence that the Answer addresses the Question, the Answer seems reasonably accurate and on-topic, but with room for improvement.
4: You have high confidence that the Answer addresses the Question, the Answer provides accurate information that addresses most of the Question.
5: You are extremely confident that the Answer addresses the Question, the Answer is highly accurate, relevant, and effectively addresses the Question in its entirety.
The output should strictly use the following template: Explanation: [provide a brief reasoning you used to derive the rating Score] and then write ‘Score: <rating>’ on the last line.
For this Self-Eval method, we also tried having the LLM report confidence in its original answer on more continuous numeric scales (e.g., 1-10 or 1-100), but the resulting scores performed worse.
Prompts used for each benchmark dataset.
Throughtout all benchmarks, TLM and the baseline LLM are prompted using the exact same prompts. In competitive AI research, these types of benchmarks are run with sophisticated many-shot chain-of-thought prompts to maximize raw LLM accuracy (example). While such complex prompting helps Foundation model providers show their new model is better than everybody else’s, it does not reflect the types of queries from typical users. Our benchmarks here use simple prompts to better reflect how LLMs are used to drive real-world business value. We are not focused on optimizing prompts to maximize LLM accuracy, and instead focus on studying the benefits of adding the TLM technology to any LLM.
The specific prompts we used to run our LLMs on each dataset are listed below.
TriviaQA:
{question text}
Therefore, the answer is
ARC: For GPT-3.5, GPT-4 and Claude 3 Haiku:
{multiple-choice question text}
Therefore, among A through D, the answer is:
For GPT-4o, GPT-4o mini and Claude 3.5 Sonnet (for which the above prompt produced overly lengthy answers):
{multiple-choice question text}
Please restrict your answer to one letter from A to D and nothing else.
SVAMP: For GPT-3.5, GPT-4, GPT-4o and Claude 3 Haiku:
{question text}
Therefore, the answer (arabic numerals) is:
For GPT-4o mini and Claude 3.5 Sonnet (for which the above prompt produced overly lengthy answers):
{question text}
Please strictly use the following template to provide your answer: Answer: [provide your numeric answer]
GSM8k: For GPT-3.5, GPT-4, GPT-4o and Claude 3 Haiku:
{question text}
Therefore, the answer (arabic numerals) is:
For GPT-4o mini and Claude 3.5 Sonnet (for which the above prompt produced overly lengthy answers):
{question text}
Please strictly use the following template to provide your answer: Answer: [provide your numeric answer]
Diagnosis:
You are a doctor looking at a patient’s symptoms. Classify the Symptoms into a single Diagnosis that best represents them. The list of available Diagnosis is: [cervical spondylosis, impetigo, urinary tract infection, arthritis, dengue, common cold, drug reaction, fungal infection, malaria, allergy, bronchial asthma, varicose veins, migraine, hypertension, gastroesophageal reflux disease, pneumonia, psoriasis, diabetes, jaundice, chicken pox, typhoid, peptic ulcer disease]. Consider why the Symptoms reflect a specific Diagnosis. In your response, respond with only a single Diagnosis out of the list. Do not write anything else.
Symptoms: {text from patient description}
Removing bad data from the benchmark datasets.
When studying initial benchmark results, we observed multiple examples where the TLM output was very high confidence, but did not match the correct answer listed in the benchmark dataset. Upon closer inspection, most of such examples actually had incorrect answers in the benchmark dataset, which can mislead AI research.
Thanks to TLM’s trustworthiness we were able to catch these incorrect answers (each was manually verified) and remove them from our benchmark. The bad data we removed from the benchmarks is shared on Hugging Face.
Example error found in the GSM8K dataset:
Question: After scoring 14 points, Erin now has three times more points than Sara, who scored 8. How many points did Erin have before?
Answer According to the Dataset: 18
TLM Trustworthiness Score for this Answer: 0.000961
(Actual Answer we determined: 10)
Example error found in the SVAMP dataset:
Question: Rachel’s tree had 4 apples. She picked 2 apples from her tree. Thereafter 3 new apples grew on the tree. How many apples are there on the tree now?
Answer According to the Dataset: 1
TLM Trustworthiness Score for this Answer: 0.001508
(Actual Answer we determined: 5)