AI agents are common in customer support but still hallucinate, even with tools. One wrong answer about refunds or delays can permanently damage trust. This article demonstrates how Cleanlab scores each response in real time, to stop bad AI agent outputs before they reach your users.
AI agents are becoming central to modern customer support, replacing traditional workflows with systems that can search, summarize, respond, and take action. But in production, an important issue remains: these agents often produce incorrect or misleading responses. Even in well-controlled environments, hallucinations still happen.
Agents are built on top of LLMs which are fundamentally brittle and randomly make errors. Connecting an LLM to external data sources or giving it control over tool invocation does not eliminate this fragility.
When agents hallucinate or return misleading information, especially in high-stakes situations like customer support, trust erodes quickly. If your AI gives the wrong answer about refund eligibility, flight delays, or health advisories, your customer may never return. Worse, they may post the error publicly.
To avoid this, your AI agent needs more than intelligence. It needs trustworthiness.
Building a Reliable Customer Support Agent with LangGraph
This case study focuses on AI agents that are designed to gather and synthesize information before responding. Let’s consider a sustomer support AI agent built using LangGraph, a flexible open-source framework for managing agentic LLM workflows.
This type of AI agent typically uses tools to look up policies or fetch real-time data, then summarizes that information for the user. But when issues occur in the tools/data or the LLM hallucinates, your agent may generate a confident-sounding but incorrect response.
Our corresponding tutorial shows a full implementation of this agent and the example runs below.
AI Agent Failure Examples in Customer Support
Failure example: Incomplete Tool Output
Customer query | AI tool calls | (Incorrect) AI response | What happened |
---|---|---|---|
What are the details of the “five” discount plan from Swiss Airlines? | Tool Calls: get_discount_plan Tool Message: Returning details for the 'five' discount plan: <end of output> | Here are the details for the “five” discount plan:
| The agent calls the right tool, but the response contains very little information. The LLM fills in the gaps by inventing a discount structure that sounds legitimate but is not based on actual data. |
Failure example: Misinterpreted Tool Result
Customer query | AI tool calls | (Incorrect) AI response | What happened |
---|---|---|---|
Is there a health advisory in Basel? | Tool Calls: get_travel_advisory Tool Message: {"level": "Level 2 – Exercise Increased Caution", "notes": "Travelers to Switzerland should be aware of petty crime and take precautions."} | There is a health advisory for Switzerland, including Basel:
| The travel advisory tool returns a general notice about petty crime in Switzerland. The LLM misreads the result and generates a response about nonexistent health risks. |
Failure example: Incorrect Assumptions
Customer query | AI tool calls | (Incorrect) AI response | What happened |
---|---|---|---|
What is my arrival time in their time zone? | Tool Calls: fetch_user_flight_information Tool Message: [{"ticket_no": "7240005432906569", "book_ref": "C46E9F", "flight_no": "LX0112", "departure_airport": "CDG", "arrival_airport": "BSL", "departure": "2025-05-30 01:36:24.169593-04:00", "arrival": "2025-05-30 03:06:24.169593-04:00"}] | Your flight LX0112 from Paris Charles de Gaulle (CDG) to EuroAirport Basel-Mulhouse-Freiburg (BSL) is scheduled to arrive at 03:06 AM (local time) on May 30, 2025. | The agent retrieves UTC timestamps for the flight but assumes the origin and destination are in the same time zone. This leads to a response that appears precise but is wrong by several hours. |
In each of these scenarios, the response may appear helpful, but it is flawed. The result is an agent that erodes trust instead of building it.
How Cleanlab Catches Agent Failures in Real Time
To solve these problems, Cleanlab’s detection capability provides a real-time trustworthiness scoring system. It analyzes the full user prompt, the AI’s response, relevant tool outputs, and all internal LLM calls made during the agent’s run to generate a trust score between 0 and 1. Higher scores indicate more reliable responses.
Unlike simple heuristics or basic LLM-as-a-judge models, Cleanlab combines statistical confidence estimation, model behavior profiling, and linguistic analysis into a single metric. It also offers natural-language explanations for why a response might be risky, helping your team trace issues quickly.
Cleanlab is benchmarked as the most accurate real-time method for detecting bad responses from any LLM. It works out of the box and requires neither labeled data nor model training infrastructure.
Using the same customer support failure examples from above, let’s see how applying trust scores can flag incorrect responses.
Preventable example: Incomplete Tool Output
Customer query | AI tool calls | (Incorrect) AI response | Cleanlab Score (0-1) |
---|---|---|---|
What are the details of the “five” discount plan from Swiss Airlines? | Tool Calls: get_discount_plan Tool Message: Returning details for the 'five' discount plan: <end of output> | Here are the details for the “five” discount plan:
| 0.43 If the tool response lacks detail and the LLM hallucinates the rest, Cleanlab flags the response with a low trust score. This signals that the answer is untrustworthy and should be intercepted. |
Preventable example: Misinterpreted Tool Result
Customer query | AI tool calls | (Incorrect) AI response | Cleanlab Score (0-1) |
---|---|---|---|
Is there a health advisory in Basel? | Tool Calls: get_travel_advisory Tool Message: {"level": "Level 2 – Exercise Increased Caution", "notes": "Travelers to Switzerland should be aware of petty crime and take precautions."} | There is a health advisory for Switzerland, including Basel:
| 0.72 When the travel advisory about petty crime is misread as a health warning, Cleanlab detects the mismatch. The response receives a middling score which is enough to trigger fallback logic in sensitive domains. |
Preventable example: Incorrect Assumptions
Customer query | AI tool calls | (Incorrect) AI response | Cleanlab Score (0-1) |
---|---|---|---|
“What is my arrival time in their time zone?” | Tool Calls: fetch_user_flight_information Tool Message: [{"ticket_no": "7240005432906569", "book_ref": "C46E9F", "flight_no": "LX0112", "departure_airport": "CDG", "arrival_airport": "BSL", "departure": "2025-05-30 01:36:24.169593-04:00", "arrival": "2025-05-30 03:06:24.169593-04:00"}] | Your flight LX0112 from Paris Charles de Gaulle (CDG) to EuroAirport Basel-Mulhouse-Freiburg (BSL) is scheduled to arrive at 03:06 AM (local time) on May 30, 2025. | 0.37 An incorrect time zone conversion results in an arrival time that’s off by several hours. Cleanlab flags this with a very low score, ensuring the bad answer never reaches the user. |
These examples show how Cleanlab adds a vital trust layer between your AI system and your customers.
Using Fallbacks for Safer Output
When Cleanlab assigns a low trust score (below 0.9), you can route the conversation to a fallback strategy.
One common fallback is a generic but safe response, such as:
“Sorry, I cannot answer that based on the available information. Please try rephrasing your question or providing more details.”
Other fallback options include:
- Escalating the case to a human agent (e.g., via LangGraph’s interrupt() human in the loop capability)
- Re-running the query with a revised prompt for the agent
- Logging the incident for future fine-tuning or tool improvements
Easy Integration with LangGraph and Other Frameworks
Adding Cleanlab to your LangGraph agent only takes a few lines of code. You wrap your existing assistant node with Cleanlab’s TrustworthyAssistant, which automatically intercepts each LLM response and calculates its trust score.
from cleanlab_tlm import TLM
tlm = TLM()
trustworthy_assistant = TrustworthyAssistant(
assistant=existing_llm_node,
tools=tools, tlm=tlm
)
You then update your graph to use the trustworthy_assistant
node in place of your existing LLM node. That’s it. Your LangGraph agent now scores every LLM response in real time and stores the result in the graph state.
See our tutorial notebook for full implementation details.
Prevent bad responses from reaching users
Modern AI agents need to do more than generate responses. They must be safe, accurate, and reliable in real-world scenarios. Cleanlab provides this safeguard by scoring every AI response in real time and flagging those that are likely to be incorrect. It works without manual review or rule-based filters. When integrated with LangGraph or any other agentic framework, Cleanlab helps ensure that flawed responses never reach your users.