Automatically boost the accuracy of any LLM, without changing your prompts or the model

October 31, 2024
  • Hui Wen GohHui Wen Goh
  • Jay ZhangJay Zhang
  • Ulyana TkachenkoUlyana Tkachenko
  • Jonas MuellerJonas Mueller

The Trustworthy Language Model (TLM) scores the trustworthiness of responses from any LLM to catch hallucinations/errors in real-time. While other hallucination-detection methods exist, a true test of such capabilities is whether you can utilize them to automatically generate more accurate LLM responses.

error-rate-reduction.png

Built on top of any base LLM, the TLM system can boost response accuracy without introducing any additional LLM model. This article benchmarks TLM systems built on top of different base models, revealing that the TLM system can reduce the error rate (incorrect answers): of GPT-4o by up to 27%, of GPT-4o mini by up to 34%, of GPT-4 by up to 10%, of GPT-3.5 by up to 22%, of Claude 3 Haiku by up to 24%, and of Claude 3.5 Sonnet by up to 20%, and of o1-preview by up to 20%. In each case, TLM’s improved responses over the base model come without altering prompts or relying on any LLM besides the base model.

DatasetGPT-4oTLM
TriviaQA88.2%89.2%
ARC96.6%96.7%
SVAMP95.0%95.2%
GSM8k74.1%81.2%
Diagnosis68.9%69.3%

The table above reports the accuracy of responses from OpenAI’s GPT-4o vs. TLM utilizing GPT-4o as its base model. Here GPT-4o is the only LLM model utilized in producing all responses for each dataset. Details of each benchmark dataset are presented further below.

Similarly, the table below reports the accuracy of responses from Claude Sonnet 3.5 vs. TLM utilizing Sonnet 3.5 as its base model. Here Sonnet 3.5 is the only LLM model utilized in producing all responses for each dataset.

DatasetClaude 3.5 SonnetTLM
TriviaQA81.5%85.3%
ARC94%94.8%
SVAMP95.6%96.2%
GSM8k95.1%95.7%
Diagnosis67.6%68.3%
Benchmark results for Claude 3 HaikuAccordion Arrow

The table below reports the accuracy of responses from Anthropic’s Claude 3 Haiku model vs. TLM utilizing Claude 3 Haiku as its base model. Here Claude 3 Haiku is the only LLM model utilized in producing all responses for each dataset.

DatasetClaude 3 HaikuTLM
TriviaQA75.3%76.5%
ARC84.7%85.5%
SVAMP93%94.7%
GSM8k87.3%90.4%
Diagnosis56%56.1%
Benchmark results for GPT 4o miniAccordion Arrow

The table below reports the accuracy of responses from OpenAI’s GPT 4o mini model vs. TLM utilizing GPT 4o mini as its base model. Here GPT 4o mini is the only LLM model utilized in producing all responses for each dataset.

DatasetGPT-4o miniTLM
TriviaQA78%79.5%
ARC92.7%93.4%
SVAMP86.9%88.7%
GSM8k68%79%
Diagnosis62.4%62.9%
Benchmark results for GPT 4Accordion Arrow

The table below reports the accuracy of responses from OpenAI’s original GPT 4 model vs. TLM utilizing GPT 4 as its base model. Here GPT 4 is the only LLM model utilized in producing all responses for each dataset.

DatasetGPT-4TLM
TriviaQA84.7%84.8%
ARC94.6%94.9%
SVAMP90.7%91.7%
GSM8k46.5%55.6%
Diagnosis67.4%68.0%
Benchmark results for GPT 3.5Accordion Arrow

The table below reports the accuracy of responses from OpenAI’s GPT 3.5 model vs. TLM utilizing GPT 3.5 as its base model. Here GPT 3.5 is the only LLM model utilized in producing all responses for each dataset.

DatasetGPT-3.5TLM
TriviaQA73.0%75.2%
ARC82.2%85.6%
SVAMP79.6%84.5%
GSM8k68.8%76.7%
Diagnosis58.3%58.6%

Results comparing the accuracy of OpenAI’s o1-preview model vs. TLM based on this model are additionally available here.

Across all datasets, we see the accuracy of TLM responses is consistently higher than the accuracy of the corresponding base LLM model. In each comparison of TLM vs. its corresponding base model, the TLM system solely utilizes this base LLM model – no other LLM model is involved in producing responses or scoring their trustworthiness. The prompts used for the base LLM model and TLM are identical in all cases.

Examples

Here are examples of benchmark questions incorrectly answered by GPT-4o, but correctly answered by TLM utilizing GPT-4o as its base model:

Question: What is a female gerbil called?

GPT-4o Response: A female gerbil is simply called a “female gerbil.” Unlike some other animals, gerbils do not have specific names for males and females.

TLM Response: A female gerbil is simply called a “doe.”

Question: Jean is two years older than Mark. Two years ago Mark was 5 years older than half Jan’s age. If Jan is 30 how old is Jean?

GPT-4o Response:

To solve this problem, we need to find the ages of Jean and Mark based on the given information.

1. We know that Jan is 30 years old.
2. Two years ago, Mark was 5 years older than half of Jan's age. Since Jan is 30, half of Jan's age is 15. Therefore, two years ago, Mark was \(15 + 5 = 20\) years old.
3. If two years ago Mark was 20, then currently Mark is \(20 + 2 = 22\) years old.
4. Jean is two years older than Mark. Therefore, Jean is \(22 + 2 = 24\) years old.

Thus, Jean is 24 years old.
Therefore, the answer (arabic numerals) is: 24

TLM Response:

Let's solve the problem step by step.

1. Let Jan's current age be 30.  
2. Two years ago, Jan's age was \(30 - 2 = 28\).
3. Two years ago, Mark was 5 years older than half of Jan's age at that time:  
   \text{Half of Jan's age two years ago} = \frac{28}{2} = 14
   \text{Mark's age two years ago} = 14 + 5 = 19
4. Now, we calculate Mark's current age:  
   \text{Mark's current age} = 19 + 2 = 21
5. Jean is two years older than Mark:  
   \text{Jean's age} = 21 + 2 = 23

Therefore, Jean is 23 years old.  
The answer (in arabic numerals) is: 23

Here are examples of benchmark questions incorrectly answered by Claude 3.5 Sonnet, but correctly answered by TLM utilizing Claude 3.5 Sonnet as its base model:

Question: Which element below is found in the smallest quantity in most living things? (A) iron (B) carbon (C) nitrogen (D) hydrogen Please restrict your answer to one letter from A to D and nothing else.

Claude 3.5 Sonnet Response: C

TLM Response: A

Question: Matthew gave equal numbers of crackers and cakes to his 4 friends. If he had 32 crackers and 98 cakes initially how many crackers did each person eat?

Claude 3.5 Sonnet Response:

To solve this problem, we need to:
1. Find the total number of items (crackers and cakes)
2. Divide the total by the number of people (Matthew + 4 friends = 5 people)
3. Compare this number to the initial number of crackers

Total items = 32 crackers + 98 cakes = 130 items
Items per person = 130 ÷ 5 = 26 items

Since Matthew gave equal numbers of crackers and cakes, each person got 13 crackers and 13 cakes.

The question asks specifically about crackers, so:

Answer: 13

TLM Response:

To solve this problem, let's follow these steps:

1. Matthew had 32 crackers initially.
2. He gave equal numbers of crackers to 4 friends.
3. We need to find out how many crackers each person ate.

Since Matthew distributed the crackers equally among 4 friends, we simply need to divide the total number of crackers by 4.

32 crackers ÷ 4 friends = 8 crackers per friend

Therefore, each person ate 8 crackers.

Answer: 8

Benchmark Details

Here’s the code used to run TLM in our benchmark, in this case using GPT 4o as the base LLM model:

tlm = TLM(quality_preset = "best", options = {"model": "gpt-4o"})
output = tlm.prompt("As of 2024, how many dogs have reached outer space?")

print(output)
>>> {
>>>     "response": "As of 2024, a total of 12 dogs have been sent into space as part of the Soviet space program during the 1950s and 1960s. The most famous of these dogs is Laika, who became the first animal to orbit the Earth aboard Sputnik 2 in 1957. Other dogs, such as Belka and Strelka, also made significant contributions to space exploration. The missions involving these dogs were crucial for understanding the effects of space travel on living organisms."	
>>>     "trustworthiness_score": 0.598
>>> }

Our benchmark considers popular Q&A datasets:

  • TriviaQA: Open-domain trivia questions.
  • ARC: Grade school multiple-choice questions (we consider the “Challenge Test” subset).
  • SVAMP: Elementary-level math word problems.
  • GSM8k: Grade school math problems.
  • Diagnosis: Classifying medical conditions based on symptom descriptions from patients.

Benchmarking challenges include: the open-domain nature of many questions, and some errors in the listed ground-truth answers. Thus, our benchmark subsets certain datasets by filtering out examples whose ground-truth answer is wrong or could not be verified. Additional benchmarking details are listed here.

Discussion

Many folks know that TLM offers state-of-the-art real-time hallucination detection for responses from any LLM. However, fewer folks are aware that TLM can be used to produce more accurate responses than any LLM – all without relying on another model. Simply set quality_preset = "best" in your TLM code, and the system will automatically improve answers from its base LLM.

Internally, the TLM system works like this to improve responses: sample multiple candidate responses from the base LLM, score their trustworthiness, and then return the most trustworthy response. This simple approach can only be effective if the trustworthiness scores reliably separate correct vs. incorrect LLM responses. Unlike other hallucination detection models, TLM can actually improve arbitrary LLM responses in this fashion – strong evidence for the generalizability of TLM’s hallucination detection capabilities.

Of course, there’s no free lunch. The improved accuracy of TLM responses requires slightly longer runtimes than the base LLM model. TLM using the accuracy-improving "best" quality_preset is thus best-suited for applications like: data processing or LLM automation of your team’s manual work. For latency-sensitive chat applications, we recommend using a lower TLM quality_preset (like "medium" or "base"), which enables you to flag incorrect/hallucinated responses in real-time.

Next Steps

  • Get started with the TLM API and run through various tutorials. Specify which base LLM model to use via the TLMOptions argument – all of the models listed in this article (and more) are supported out-of-the-box. Specify TLM’s best quality_preset to maximally boost response accuracy.

  • Demo TLM through our interactive playground.

  • Refer to our original blogpost to learn more about TLM and see additional benchmarks.

  • This article showcased the generality of TLM across various base LLM models that our public API provides out-of-the-box. If you’d like a (private) version of TLM based on your own custom LLM, get in touch!

Related Blogs
Automatically detecting LLM hallucinations with models like GPT-4o and Claude
Benchmarking hallucination detection via the Trustworthy Language Model, now using the newest models from OpenAI and Anthropic.
Read morearrow
Ensure high-quality data quickly via AI validation of which data is Well Labeled
How automated quality assurance can help data annotation teams ensure accurate data with less work.
Read morearrow
Get started today
Try Cleanlab Studio for free and automatically improve your dataset — no code required.
More resourcesarrow
Explore applications of Cleanlab Studio via blogs, tutorials, videos, and read the research that powers this next-generation platform.
Join us on Slackarrow
Join the Cleanlab Community to ask questions and see how scientists and engineers are practicing Data-Centric AI.