Trustworthy Language Model (TLM)
The Problem with LLMs
Generative AI and Large Language Models (LLMs) are revolutionizing automation and data-driven decision-making. But there’s a catch: LLMs often produce "hallucinations", generating incorrect or nonsensical answers that can undermine your business.
The Solution: Add Trust to Every Response
Cleanlab's Trustworthy Language Model (TLM) scores the trustworthiness of every LLM response, letting you know which responses are reliable and which ones need extra scrutiny. TLM automatically detects incorrect LLM outputs in real-time—perfect for enterprise applications where unchecked hallucinations are unacceptable. Get started with our quick Python API tutorial , or read about use-cases and benchmarks .
Trustworthiness Scores.
Higher accuracy.
Scalable API.
Unlock Reliable AI for Enterprise Applications.
TLM adds trust and reliability to any LLM use case.
Retrieval-Augmented Generation
Chatbots
Data Labeling
Data Extraction
Explain why a LLM response is untrustworthy.
Proven Impact on enterprise deployment.
Cleanlab TLM can be integrated into your existing LLM-based workflows to improve accuracy and reliability. With a trustworthiness score for each response, you can manage the risks of LLM hallucinations and avoid costly errors.
"Cleanlab's TLM is the first viable answer to LLM hallucinations that I've seen. Our human-in-the-loop workflows are now 80% automated, saving us enormous time and resources. The downstream cost savings are substantial, with 10x to 100x ROl for many of our clients."
Steven Gawthorpe, PhD | Associate Director and Senior Data Scientist at BRG
FAQ
How does TLM work?
The TLM scores our confidence that a response is good for a given request. In question-answering applications, good would correspond to whether the answer is correct or not. In general open-ended applications, good corresponds to whether the response is helpful/informative and clearly better than alternative hypothetical responses. For extremely open-ended requests, TLM trustworthiness scores may not be as useful as for requests that are questions seeking a correct answer.
TLM trustworthiness scores are a form of machine learning model uncertainty estimate. Machine learning models may produce uncertain outputs when given inputs that are fundamentally difficult (i.e. prompts that are vague or complex) or different from the model’s training data (i.e. prompts that are atypical or based on niche information/facts).
TLM comprehensively quantifies the uncertainty in responding to a given request via multiple operations:
- self-reflection: a process in which the LLM is asked to explicitly rate the response and explicitly state how confidently good this response appears to be.
- probabilistic prediction: a process in which we consider the per-token probabilities assigned by a LLM as it generates a response based on the request (remember LLMs are trained to predict the probability of the next word/token in a sequence).
- observed consistency: a process in which the LLM probabilistically generates multiple plausible responses it thinks could be good, and we measure how contradictory these responses are to each other (or a given response).
These operations produce various trustworthiness measures, which are combined into an overall trustworthiness score that captures all relevant types of uncertainty. For instance, vague requests that could be answered in many ways generally yield lower token probabilities. Self-reflection can detect error-prone reasoning steps made when processing complex requests as well as unsupported factual statements. Observed consistency addresses fragilities in generation like tokenization/sampling that can yield vastly different responses.
For more details, refer to our research paper published at ACL, the top venue for NLP and Generative AI research. Our publication rigorously describes certain foundational components of the TLM system.
Get in touch to learn more.
How well does TLM work?
Comprehensive benchmarks are provided in our blog. These reveal that TLM detects hallucinations/errors with significantly higher precision/recall than other methods, across many datasets, tasks, and LLM models.
Using the best
or high
quality_preset, TLM can additionally return more accurate LLM responses than the base LLM. Our benchmarks show that TLM can reduce the error rate (incorrect answers): of GPT-4o by 27%, of GPT-4o mini by 34%, of GPT-4 by 10%, of GPT-3.5 by 22%, and of Claude 3 Haiku by 24%.
Get in touch to learn more.
I am using LLM model ___, how can I use TLM?
Two primary ways to use TLM are the prompt()
and get_trustworthiness_score()
methods. The former can be used as a drop-in replacement for any standard LLM API, returning trustworthiness scores in addition to responses from one of TLM’s supported base LLM models. Here the response and trustworthiness score are both produced using the same LLM model.
Alternatively, you can produce responses using any LLM, and just use TLM to subsequently score their trustworthiness.
If you would like to both produce responses and score their trustworthiness using your own custom (private) LLM, get in touch regarding our Enterprise plan.
What about latency-sensitive applications (like Chat)?
TLM is used in many companies’ chat applications, scoring the trustworthiness of every LLM output in real time.
See our FAQ and Advanced Tutorial for configurations you can change to significantly improve latency.
Instead of using TLM to produce responses, you can alternatively stream in responses from your own LLM, and use TLM.get_trustworthiness_score()
to subsequently stream in the corresponding trustworthiness score.
Get in touch to learn more.
Do you offer private deployments in VPC?
Yes, TLM can be deployed in your company’s own cloud such that all data remains within your private infrastructure. All major cloud providers and LLM models are supported.
Get in touch to learn more.