We’re excited to launch Cleanlab’s TLM Lite, a version of the Trustworthy Language Model (TLM) that combines high-quality LLM responses with efficient trustworthiness evaluation. By utilizing different LLMs for generating responses and trust scoring, TLM Lite allows you to leverage advanced models while speeding up evaluations and keeping costs manageable, enabling you to deploy reliable generative AI for use cases that were previously too costly or have low quality. Benchmarks show that TLM Lite produces more effective trustworthiness scores than other automated LLM evaluation approaches like self-evaluation.
The Challenge of LLM Evaluation: Balancing Quality and Efficiency
Generating high-quality responses with advanced large language models (LLMs) can come at a high computational and financial cost. On the other hand, evaluating the trustworthiness of these responses often requires additional resources, further making the entire process more cumbersome and expensive. These issues can hinder the practical deployment of LLMs in many real-world applications.
TLM Lite addresses this challenge by streamlining the use of different models in the response generation and trustworthiness evaluation processes. You can leverage powerful LLMs to produce better responses, while employing smaller, more efficient models to score the trustworthiness of these responses. This hybrid approach gives you the best of both worlds: high-quality responses and cost-effective (while still reliable) trust evaluations.
Using TLM Lite
Similar to Cleanlab’s TLM, TLM Lite provides a .prompt()
method which will return a LLM response along with a trustworthiness score. TLM Lite can be used in a similar manner as the TLM, but with the added flexibility where users can specify a different response model and scoring model to determine which model generates responses for given prompts and which model handles the trust scoring.
We recommend using a stronger model, such as GPT-4 or GPT-4o, for generating responses to ensure high-quality outputs, while utilizing smaller, more efficient models such as GPT-4o mini for trustworthiness scoring to minimize costs and evaluation times. However, you can use any combination of models you like. The generally-available version of TLM Lite lets you choose between a number of popular base LLM models from OpenAI and Anthropic, but TLM Lite can augment any LLM with only black-box access to the LLM API. Enterprise customers can even add trustworthiness to a custom fine-tuned LLM – just contact us.
Evaluating TLM Lite Performance
We evaluate the performance of TLM Lite by benchmarking both the regular TLM (using GPT-4o for all tasks) and TLM Lite (using GPT-4o for response generation and GPT-4o mini for trustworthiness scoring). Since the responses from both TLM and TLM Lite are generated by GPT-4o, our focus is solely on evaluating the reliability of each model’s trustworthiness scores. We want to see how many wrong LLM responses can we catch under a limited review budget by prioritizing via trustworthiness scores.
In this benchmark, we also compare against another popular approach to estimate the confidence of LLM responses, Self-Eval, which asks the LLM to evaluate its own output and rate its confidence on a scale of 1-5. This is done in a subsequent request to the LLM (details in Appendix), also using the smaller GPT-4o mini model.
Benchmark Datasets
Our study focuses on Q&A settings. Unlike other LLM benchmarks, we never measure benchmark performance using LLM-based evaluations. All of our benchmarks involve questions with a single correct answer, and benchmark performance is based on whether or not the LLM response matches this known ground-truth answer. We consider these popular Q&A datasets:
- TriviaQA: Open-domain trivia questions.
- ARC: Grade school multiple-choice questions (we consider the “Challenge Test” subset).
- SVAMP: Elementary-level math word problems.
- GSM8k: Grade school math problems.
- Diagnosis: Diagnosing medical conditions based on symptom descriptions from the patient.
Benchmark Results
We evaluate the aforementioned approaches to estimate trustworthiness scores for each LLM response (again using GPT-4o as the response-generating LLM): TLM, TLM Lite and Self-Eval.
We evaluate the utility of these trustworthiness scores via: the probability that LLM response #1 receives a higher trustworthiness score than LLM response #2, where the former is randomly selected from the subset of model responses that were correct, and the latter from the subset of model responses that were incorrect. Widely used to assess diagnostic scores, this evaluation metric is known as the Area under the Receiver Operating Characteristic Curve (AUROC). The following table reports the AUROC achieved by each trustworthiness scoring method in each dataset:
Dataset | Self-Eval | TLM | TLM Lite |
---|---|---|---|
TriviaQA | 0.61 | 0.817 | 0.746 |
ARC | 0.692 | 0.850 | 0.771 |
SVAMP | 0.529 | 0.788 | 0.8 |
GSM8k | 0.424 | 0.659 | 0.646 |
Diagnosis | 0.58 | 0.722 | 0.668 |
To reflect expected performance in applications where our AI can abstain from responding, the following table reports the accuracy of LLM responses whose associated trustworthiness score falls in the top 80% over each dataset (all based on GPT-4o mini):
Dataset | Self-Eval | TLM | TLM Lite | Baseline Accuracy (without filtering top 80%) |
---|---|---|---|---|
TriviaQA | 88.8% | 94.8% | 92.3% | 88.2% |
ARC | 97.8% | 99.2% | 98.5% | 96.6% |
SVAMP | 95.5% | 97.7% | 97.6% | 95.0% |
GSM8k | 72.9% | 77.0% | 76.1% | 74.1% |
Diagnosis | 72.8% | 75.8% | 72.8% | 68.9% |
We see in the results above that while TLM Lite does not perform as well as TLM (due to using a smaller model), it still provides a much better trustworthiness score evaluation when compared to the popular self-evaluation method.
Conclusion
TLM Lite builds on the foundation of TLM and introduces a more flexible, cost-effective, and efficient approach to managing the trade-off between response quality and trustworthiness. By allowing users to customize the response and trust scoring models separately, TLM Lite optimizes the performance and cost-efficiency of obtaining trustworthy LLM responses, which enables rapid, scalable deployment of generative AI solutions across various applications.
Get started with the TLM Lite API for free.
For ultra latency-sensitive applications, consider the following approach instead of TLM Lite: Stream in responses from your own LLM, and use TLM.get_trustworthiness_score() to subsequently stream in the corresponding trustworthiness score.
Appendix
Expand each collapsible section below to learn more.
Additional benchmarking details.
The prompt used for the Self-Eval method, via a separate request to have the LLM evaluate its previous response was:
Question: {question}
Answer: {LLM response}
Evaluate how confident you are that the given Answer is a good and accurate response to the Question.
Please assign a Score using the following 5-point scale:
1: You are not confident that the Answer addresses the Question at all, the Answer may be entirely off-topic or irrelevant to the Question.
2: You have low confidence that the Answer addresses the Question, there are doubts and uncertainties about the accuracy of the Answer.
3: You have moderate confidence that the Answer addresses the Question, the Answer seems reasonably accurate and on-topic, but with room for improvement.
4: You have high confidence that the Answer addresses the Question, the Answer provides accurate information that addresses most of the Question.
5: You are extremely confident that the Answer addresses the Question, the Answer is highly accurate, relevant, and effectively addresses the Question in its entirety.
The output should strictly use the following template: Explanation: [provide a brief reasoning you used to derive the rating Score] and then write ‘Score: <rating>’ on the last line.
For this Self-Eval method, we also tried having the LLM report confidence in its original answer on more continuous numeric scales (e.g., 1-10 or 1-100), but the resulting scores performed worse.
Prompts used for each benchmark dataset.
Throughtout all benchmarks, TLM and the baseline LLM are prompted using the exact same prompts. In competitive AI research, these types of benchmarks are run with sophisticated many-shot chain-of-thought prompts to maximize raw LLM accuracy (example). While such complex prompting helps Foundation model providers show their new model is better than everybody else’s, it does not reflect the types of queries from typical users. Our benchmarks here use simple prompts to better reflect how LLMs are used to drive real-world business value. We are not focused on optimizing prompts to maximize LLM accuracy, and instead focus on studying the benefits of adding the TLM technology to any LLM.
The specific prompts we used to run our LLMs on each dataset are listed below.
TriviaQA:
{question text}
Therefore, the answer is
ARC:
{multiple-choice question text}
Please restrict your answer to one letter from A to D and nothing else.
SVAMP:
{question text}
Therefore, the answer (arabic numerals) is:
GSM8k:
{question text}
Therefore, the answer (arabic numerals) is:
Diagnosis:
You are a doctor looking at a patient’s symptoms. Classify the Symptoms into a single Diagnosis that best represents them. The list of available Diagnosis is: [cervical spondylosis, impetigo, urinary tract infection, arthritis, dengue, common cold, drug reaction, fungal infection, malaria, allergy, bronchial asthma, varicose veins, migraine, hypertension, gastroesophageal reflux disease, pneumonia, psoriasis, diabetes, jaundice, chicken pox, typhoid, peptic ulcer disease]. Consider why the Symptoms reflect a specific Diagnosis. In your response, respond with only a single Diagnosis out of the list. Do not write anything else.
Symptoms: {text from patient description}