Automated Hallucination Correction for AI Agents: A Case Study on Tau²-Bench

December 3, 2025
  • Tianyi HuangTianyi Huang
  • Jonas MuellerJonas Mueller

AI agents can fail in multi-turn, tool-use tasks when some erroneous intermediate LLM output derails the agent. One way to detect such reasoning errors, hallucinations, or incorrect tool calls is to use real-time LLM trust scoring techniques. Here we find that for the leading customer service AI agent benchmark, Tau²-Bench, LLM trust scoring with straightforward fallback strategies automatically cuts agent failure rates by up to 57% (code to reproduce).

Diagram of a trustworthiness scoring architecture showing how the different paths messages can take depending on their evaluation.
How one can automatically score the trustworthiness of each LLM message within an agent, and when low, fall back to either: (i) escalating this customer interaction to a human support employee, or (ii) re-generating the LLM message to autonomously keep the agent on the rails.

AI agents are revolutionizing customer service, but remain too unreliable for complex tasks with high-stakes actions. The underlying LLMs occasionally produce incorrect outputs, and a single such error can irrecoverably derail an agent — a major risk given how many LLM calls each agent interaction requires.

To study agent reliability, we run Tau²-Bench. This leading customer service AI benchmark spans three domains (airline, retail, telecom) with complex customer interactions that require proper use of many tools, multi-step decision-making, and dealing with unpredictable customers. For example, in the telecom domain, the AI agent might be tasked with helping a customer fix their internet connection, where it must prompt the customer to check certain settings on their phone (such as airplane mode), and must use tools to check things about the customer (such as their data plan). For a particular customer interaction to be deemed successful in Tau²-Bench, the agent must call the proper set of tools and respond to the customer with particular information (in the Tau²-Bench terminology, we only consider pass^1 here).

Diagram of the original Tau²-Bench architecture showing user and customer-service agents exchanging messages and calling their respective tools.
Architecture of the original Tau²-Bench AI agent (a standard LLM tool calling loop).

Real-Time Trust Scoring for Agents

One way to deal with LLMs’ jagged intelligence and lack of reliability is to rely on real-time trustworthiness scoring of each LLM output in an agent’s message chain. Cleanlab’s Trustworthy Language Model (TLM) provides low-latency trust scores with state-of-the-art precision for detecting incorrect outputs from any LLM, regardless whether the output is a natural language message or a tool call. TLM helps you automatically mitigate all sorts of LLM mistakes, including: reasoning errors, hallucinated facts, misunderstandings/oversights, system instruction violations, and wrong tool calls.

Consider the below example, where the Tau²-Bench agent for some reason runs the same tool twice (LLMs just make unpredictable mistakes like this). Because this shouldn’t happen, TLM gives the 2nd tool call output a low trustworthiness score.

Example of Trustworthiness Grading

This article considers two ways that trust scores can improve the original Tau²-Bench AI agent. Our approaches are easy to implement for any AI agent.

Escalating Untrustworthy Cases to Humans

For organizations seeking to provide high-quality customer support, interactions that the AI agent is likely to incorrectly handle should be escalated to a human support representative instead. One option is to equip the agent with a special escalate() tool or phrase, but LLMs remain pretty bad at knowing when to ask for help or knowing when they don’t know. TLM’s trustworthiness scores capture overall aleatoric and epistemic uncertainty in LLM outputs, enabling significantly better identification of cases where the agent makes a mistake.

Flowchart of the automated escalation pipeline where messages judged untrustworthy are routed to a human customer-service agent.
Automated Escalation - If a LLM message is deemed untrustworthy, this customer interaction is escalated to a human to avoid agent failure.

As depicted above, we can escalate customer interactions as soon as one of the agent’s LLM outputs falls below a trustworthiness score threshold. We assess the performance of this Automated Escalation pipeline via the agent’s failure rate among the remaining interactions (tasks) that were not escalated. The results below demonstrate that this approach can effectively reduce this failure rate across all Tau²-Bench domains, with lower trustworthiness thresholds providing greater failure rate reductions (as expected).

Results of the automated escalation pipeline.
Results from running the agent with our automated escalation pipeline, where the customer service AI agent is powered by OpenAI's GPT-5 LLM. In this pipeline, customer interactions are escalated to human support when one of the agent's LLM messages receives trust score below a threshold (results for two threshold values plotted separately).

Autonomously Revising Untrustworthy LLM Messages

Curbing agent failures via escalation (or similarly via abstention fallbacks) can ensure that your agent meets enterprise failure-rate requirements, but the agent did not manage to help the customer in these scenarios. Let’s now consider how to automatically boost the success rate of the agent.

Flowchart of the message-revision pipeline where the agent rewrites messages deemed untrustworthy to make them more reliable.
Automated Message Revision - If a LLM message is deemed untrustworthy, the agent autonomously re-generates this message to make it more trustworthy.

Here’s how our Automated Message Revision pipeline depicted above works: As before, we score the trustworthiness of each of the agent’s LLM messages, in real-time before any tools are actually called or this message is presented to the customer. If the trustworthiness score falls below a threshold, we automatically re-generate another LLM message to replace it before continuing to execute the agent and any tool calls. TLM also provides explanations to rationalize why a certain LLM output is untrustworthy. To automatically re-generate a better LLM message, we rely on the same input to the original LLM call, but first append to it an extra statement which reports the original LLM output and cautions that it was flagged as untrustworthy along with the explanation why. If the newly generated LLM message receives higher trustworthiness score than the original LLM message, then it replaces the original LLM message (and otherwise execution continues using the original LLM message).

Results of the automated message revision pipeline.
Results from running the agent with our automated message revision pipeline, where the customer service AI agent is powered by OpenAI's GPT-5 LLM.

The above results demonstrate that this fully autonomous approach can effectively boost the agent’s overall success rate in all Tau²-Bench domains. Below are some LLM outputs from three Tau²-Bench customer interactions, showing concrete successes thanks to automated message revision. In each example: the original Tau²-Bench agent ended up failing this interaction, but trust-score-powered automated message revision enables another copy of the same agent to succeed.

Sample Message Revision 1.
Example 1: A user wants to modify an order, and the agent has found the user's orders so far. The agent then attempts to find the specific order by looking up the details of each of their orders (left). The agent does this by outputting multiple `get_order_details()` tool calls at the same time, which violates the agent's system instructions specifying that tool calls must be done one at a time. TLM assigns a low trust score to this LLM message, offering the explanation pictured. Thus, our automated message revision pipeline catches this issue and only sends out the first tool call, which leads the agent to the correct order (right).
Sample Message Revision 2.
Example 2: A user wants to return their digital camera and provides their name, ZIP code, and order number. The agent has so far found the user's ID and order details. The agent then attempts to get the user's details, which is an unnecessary tool call since the agent's next action should depend on what the user wants to do with their order rather than needing additional details about the user (left). Since this LLM message is scored with low trustworthiness, our automated message revision pipeline catches this issue and replaces this message with one asking the user what they want to do with their order instead (right).
Sample Message Revision 3.
Example 3: A user wants to swap out an item in their order for a coat, they specifically say a wool overcoat or peacoat. The agent then attempts to find details on a fleece jacket, which is not what the user asked for (left). Since this LLM message is scored with low trustworthiness, our automated message revision pipeline catches this issue and replaces this message with one informing the user that there are no coats available, but there are some fleece jackets (right).

Benchmarking Agents powered by other LLMs

In the results thus far, we left Tau²-Bench configurations at provided defaults (including the Tau²-Bench customer simulator which is powered by OpenAI’s GPT-4.1 LLM). Throughout all benchmarks, TLM is run with default settings, which utilizes OpenAI’s GPT-4.1-mini LLM to ensure low latency/cost. No other model was used for trust scoring here, although TLM is capable of utilizing any LLM for trust scoring (TLM detects AI mistakes/hallucinations with even greater precision when utilizing a more powerful LLM).

Now we repeat the above benchmarks in a different setting, this time powering the customer service agents (and Tau²-Bench’s customer simulator) using a different model: OpenAI’s GPT-4.1-mini LLM. In this alternative setting, both the customer support agent and users are more prone to mistakes (details in Appendix). Below are the corresponding results for this alternative benchmark setting.

Results of the automate escalation pipeline.
Results from running the agent with our automated escalation pipeline, in alternative setting where the customer service AI agent is powered by OpenAI's GPT-4.1-mini LLM.
Results of the automated message revision pipeline.
Results from running the agent with our automated message revision pipeline, in alternative setting where the customer service AI agent is powered by OpenAI's GPT-4.1-mini LLM.

A Reliability Layer for your AI Agents

Beyond the Tau²-Bench gains presented here, Cleanlab is similarly improving customer-facing AI agents across enterprises, from financial institutions to government agencies. These teams also utilize trust scoring in offline evaluations to quickly spot an agent’s failure modes.

Reasoning, planning, tool use, and multi-step agentic execution expand AI capabilities, as well as the surface of possible mistakes. For any AI agent, Cleanlab’s trust scoring provides an additional layer of defense: don’t say/do it if you can’t trust it.

Try Cleanlab’s TLM API, and add real-time trust scores to any LLM/agentic application in minutes.

Appendix

Code to reproduce all of our results is available here.

Related Blogs
Expert Answers: The Easiest Way to Improve Your AI Agent
AI agents often give wrong, IDK, or unhelpful answers that frustrate users. Expert Answers let nontechnical SMEs instantly fix these cases, making your AI more helpful without waiting for engineers.
Read more
Managing AI Agents in Production: The Role of People
From guardrails to remediation, people keep AI agents aligned in production. Discover the oversight roles, levels of involvement, and steps engineering leaders can take to scale responsibly.
Read more
Prevent Hallucinated Responses from any AI Agent
A case study on a reliable Customer Support Agent built with LangGraph and automated trustworthiness scoring
Read more