We’re excited to launch Cleanlab’s TLM Lite, a version of the Trustworthy Language Model (TLM) that combines high-quality LLM responses with efficient trustworthiness evaluation. By utilizing different LLMs for generating responses and trust scoring, TLM Lite allows you to leverage advanced models while speeding up evaluations and keeping costs manageable, enabling you to deploy reliable generative AI for use cases that were previously too costly or have low quality. Benchmarks show that TLM Lite produces more effective trustworthiness scores than other automated LLM evaluation approaches like self-evaluation.
The Challenge of LLM Evaluation: Balancing Quality and Efficiency
Generating high-quality responses with advanced large language models (LLMs) can come at a high computational and financial cost. On the other hand, evaluating the trustworthiness of these responses often requires additional resources, further making the entire process more cumbersome and expensive. These issues can hinder the practical deployment of LLMs in many real-world applications.
TLM Lite addresses this challenge by streamlining the use of different models in the response generation and trustworthiness evaluation processes. You can leverage powerful LLMs to produce better responses, while employing smaller, more efficient models to score the trustworthiness of these responses. This hybrid approach gives you the best of both worlds: high-quality responses and cost-effective (while still reliable) trust evaluations.
Using TLM Lite
Similar to Cleanlab’s TLM, TLM Lite provides a .prompt() method which will return a LLM response along with a trustworthiness score. TLM Lite can be used in a similar manner as the TLM, but with the added flexibility where users can specify a different response model and scoring model to determine which model generates responses for given prompts and which model handles the trust scoring.
We recommend using a stronger model, such as GPT-4 or GPT-4o, for generating responses to ensure high-quality outputs, while utilizing smaller, more efficient models such as GPT-4o mini for trustworthiness scoring to minimize costs and evaluation times. However, you can use any combination of models you like. The generally-available version of TLM Lite lets you choose between a number of popular base LLM models from OpenAI and Anthropic, but TLM Lite can augment any LLM with only black-box access to the LLM API. Enterprise customers can even add trustworthiness to a custom fine-tuned LLM – just contact us.
Evaluating TLM Lite Performance
We evaluate the performance of TLM Lite by benchmarking both the regular TLM (using GPT-4o for all tasks) and TLM Lite (using GPT-4o for response generation and GPT-4o mini for trustworthiness scoring). Since the responses from both TLM and TLM Lite are generated by GPT-4o, our focus is solely on evaluating the reliability of each model’s trustworthiness scores. We want to see how many wrong LLM responses can we catch under a limited review budget by prioritizing via trustworthiness scores.
In this benchmark, we also compare against another popular approach to estimate the confidence of LLM responses, Self-Eval, which asks the LLM to evaluate its own output and rate its confidence on a scale of 1-5. This is done in a subsequent request to the LLM (details in Appendix), also using the smaller GPT-4o mini model.
Benchmark Datasets
Our study focuses on Q&A settings. Unlike other LLM benchmarks, we never measure benchmark performance using LLM-based evaluations. All of our benchmarks involve questions with a single correct answer, and benchmark performance is based on whether or not the LLM response matches this known ground-truth answer. We consider these popular Q&A datasets:
- TriviaQA: Open-domain trivia questions.
- ARC: Grade school multiple-choice questions (we consider the “Challenge Test” subset).
- SVAMP: Elementary-level math word problems.
- GSM8k: Grade school math problems.
- Diagnosis: Diagnosing medical conditions based on symptom descriptions from the patient.
Benchmark Results
We evaluate the aforementioned approaches to estimate trustworthiness scores for each LLM response (again using GPT-4o as the response-generating LLM): TLM, TLM Lite and Self-Eval.
We evaluate the utility of these trustworthiness scores via: the probability that LLM response #1 receives a higher trustworthiness score than LLM response #2, where the former is randomly selected from the subset of model responses that were correct, and the latter from the subset of model responses that were incorrect. Widely used to assess diagnostic scores, this evaluation metric is known as the Area under the Receiver Operating Characteristic Curve (AUROC). The following table reports the AUROC achieved by each trustworthiness scoring method in each dataset:
| Dataset | Self-Eval | TLM | TLM Lite |
|---|---|---|---|
| TriviaQA | 0.61 | 0.817 | 0.746 |
| ARC | 0.692 | 0.850 | 0.771 |
| SVAMP | 0.529 | 0.788 | 0.8 |
| GSM8k | 0.424 | 0.659 | 0.646 |
| Diagnosis | 0.58 | 0.722 | 0.668 |
To reflect expected performance in applications where our AI can abstain from responding, the following table reports the accuracy of LLM responses whose associated trustworthiness score falls in the top 80% over each dataset (all based on GPT-4o mini):
| Dataset | Self-Eval | TLM | TLM Lite | Baseline Accuracy (without filtering top 80%) |
|---|---|---|---|---|
| TriviaQA | 88.8% | 94.8% | 92.3% | 88.2% |
| ARC | 97.8% | 99.2% | 98.5% | 96.6% |
| SVAMP | 95.5% | 97.7% | 97.6% | 95.0% |
| GSM8k | 72.9% | 77.0% | 76.1% | 74.1% |
| Diagnosis | 72.8% | 75.8% | 72.8% | 68.9% |
We see in the results above that while TLM Lite does not perform as well as TLM (due to using a smaller model), it still provides a much better trustworthiness score evaluation when compared to the popular self-evaluation method.
Conclusion
TLM Lite builds on the foundation of TLM and introduces a more flexible, cost-effective, and efficient approach to managing the trade-off between response quality and trustworthiness. By allowing users to customize the response and trust scoring models separately, TLM Lite optimizes the performance and cost-efficiency of obtaining trustworthy LLM responses, which enables rapid, scalable deployment of generative AI solutions across various applications.
Get started with the TLM Lite API for free.
For ultra latency-sensitive applications, consider the following approach instead of TLM Lite: Stream in responses from your own LLM, and use TLM.get_trustworthiness_score() to subsequently stream in the corresponding trustworthiness score.
Appendix
Expand each collapsible section below to learn more.
