TLM

Trustworthy Language Model (TLM)

Reliability and explainability added to every LLM output. Smart-routing for LLM-automated responses and decision-making using trustworthiness scores for every LLM output. Cleanlab can also help your organization turn any LLM into a TLM.

The Problem with LLMs

Generative AI and Large Language Models (LLMs) are revolutionizing automation and data-driven decision-making. But there’s a catch: LLMs often produce "hallucinations", generating incorrect or nonsensical answers that can undermine your business.

The Solution: Add Trust to Every Response

Cleanlab's Trustworthy Language Model (TLM) scores the trustworthiness of every LLM response, letting you know which responses are reliable and which ones need extra scrutiny. TLM automatically detects incorrect LLM outputs in real-time—perfect for enterprise applications where unchecked hallucinations are unacceptable. Get started with our quick Python API tutorial , or read about use-cases and benchmarks .

Trustworthiness Scores.

Each response comes with a trustworthiness score, helping you reliably gauge the likelihood of hallucinations.

Higher accuracy.

Rigorous benchmarks show TLM consistently produces more accurate results than other LLMs like GPT 4 / 4o and Claude.

Scalable API.

Designed to handle large datasets, TLM is suitable for most enterprise applications, including data extraction, tagging/labeling, Q&A (RAG), and more.

Unlock Reliable AI for Enterprise Applications.

TLM adds trust and reliability to any LLM use case.

Retrieval-Augmented Generation

TLM tells you which RAG responses are unreliable by providing a trustworthiness score for every RAG answer relative to a given question. Ensure users don't lose trust in your Q&A system and review untrustworthy responses to discover possible improvements.
Explore the tutorial

Chatbots

TLM informs you which LLM outputs you can use directly (refund, reply, auto-triage) and which LLM outputs you should flag or escalate for human review based on the corresponding trustworthiness score. With standard LLM APIs, you do not know which outputs to trust.
Explore the tutorial

Data Labeling

Save on human data annotation costs. TLM auto-labels data with high accuracy and reliable confidence scores. Let the LLM automatically handle the 99% of data where it is trustworthy, and manually review the remaining 1%.
Explore the tutorial

Data Extraction

TLM tells you which data auto-extracted from documents, databases, transcripts is trustworthy and which should be double checked. Transform raw unstructured information into structured data, with fewer errors and 90% less time spent reviewing outputs.
Explore the tutorial

Explain why a LLM response is untrustworthy.

Use TLM to not only catch hallucinations, but understand them as well via built-in explainability.

Proven Impact on enterprise deployment.

Cleanlab TLM can be integrated into your existing LLM-based workflows to improve accuracy and reliability. With a trustworthiness score for each response, you can manage the risks of LLM hallucinations and avoid costly errors.

"Cleanlab's TLM is the first viable answer to LLM hallucinations that I've seen. Our human-in-the-loop workflows are now 80% automated, saving us enormous time and resources. The downstream cost savings are substantial, with 10x to 100x ROl for many of our clients."

Steven Gawthorpe, PhD | Associate Director and Senior Data Scientist at BRG

FAQ

How does TLM work?Accordion Arrow

The TLM scores our confidence that a response is good for a given request. In question-answering applications, good would correspond to whether the answer is correct or not. In general open-ended applications, good corresponds to whether the response is helpful/informative and clearly better than alternative hypothetical responses. For extremely open-ended requests, TLM trustworthiness scores may not be as useful as for requests that are questions seeking a correct answer.

TLM trustworthiness scores are a form of machine learning model uncertainty estimate. Machine learning models may produce uncertain outputs when given inputs that are fundamentally difficult (i.e. prompts that are vague or complex) or different from the model’s training data (i.e. prompts that are atypical or based on niche information/facts).

TLM comprehensively quantifies the uncertainty in responding to a given request via multiple operations:

  • self-reflection: a process in which the LLM is asked to explicitly rate the response and explicitly state how confidently good this response appears to be.
  • probabilistic prediction: a process in which we consider the per-token probabilities assigned by a LLM as it generates a response based on the request (remember LLMs are trained to predict the probability of the next word/token in a sequence).
  • observed consistency: a process in which the LLM probabilistically generates multiple plausible responses it thinks could be good, and we measure how contradictory these responses are to each other (or a given response).

These operations produce various trustworthiness measures, which are combined into an overall trustworthiness score that captures all relevant types of uncertainty. For instance, vague requests that could be answered in many ways generally yield lower token probabilities. Self-reflection can detect error-prone reasoning steps made when processing complex requests as well as unsupported factual statements. Observed consistency addresses fragilities in generation like tokenization/sampling that can yield vastly different responses.

For more details, refer to our research paper published at ACL, the top venue for NLP and Generative AI research. Our publication rigorously describes certain foundational components of the TLM system.

Get in touch to learn more.

How well does TLM work?Accordion Arrow

Comprehensive benchmarks are provided in our blog. These reveal that TLM detects hallucinations/errors with significantly higher precision/recall than other methods, across many datasets, tasks, and LLM models.

Using the best or high quality_preset, TLM can additionally return more accurate LLM responses than the base LLM. Our benchmarks show that TLM can reduce the error rate (incorrect answers): of GPT-4o by 27%, of GPT-4o mini by 34%, of GPT-4 by 10%, of GPT-3.5 by 22%, and of Claude 3 Haiku by 24%.

Get in touch to learn more.

I am using LLM model ___, how can I use TLM?Accordion Arrow

Two primary ways to use TLM are the prompt() and get_trustworthiness_score() methods. The former can be used as a drop-in replacement for any standard LLM API, returning trustworthiness scores in addition to responses from one of TLM’s supported base LLM models. Here the response and trustworthiness score are both produced using the same LLM model.

Alternatively, you can produce responses using any LLM, and just use TLM to subsequently score their trustworthiness.

If you would like to both produce responses and score their trustworthiness using your own custom (private) LLM, get in touch regarding our Enterprise plan.

What about latency-sensitive applications (like Chat)?Accordion Arrow

TLM is used in many companies’ chat applications, scoring the trustworthiness of every LLM output in real time.

See our FAQ and Advanced Tutorial for configurations you can change to significantly improve latency.

Instead of using TLM to produce responses, you can alternatively stream in responses from your own LLM, and use TLM.get_trustworthiness_score() to subsequently stream in the corresponding trustworthiness score.

Get in touch to learn more.

Do you offer private deployments in VPC?Accordion Arrow

Yes, TLM can be deployed in your company’s own cloud such that all data remains within your private infrastructure. All major cloud providers and LLM models are supported.

Get in touch to learn more.

Turn your LLM into a TLM today.
It’s free to try, with no credit card required.