Benchmarking Hallucination Detection Methods in RAG

Unchecked hallucination remains a big problem in today’s Retrieval-Augmented Generation applications. This study evaluates popular hallucination detectors across 4 public RAG datasets. Using precision/recall, we report how well methods like RAGAS, G-eval, LLM self-evaluation, the DeepEval hallucination metric, and the Trustworthy Language Model are able to automatically flag incorrect LLM responses.

The Problem: Hallucinations and Errors in RAG Systems

Large Language Models (LLM) are known to hallucinate incorrect answers when asked questions not well-supported within their training data. Retrieval Augmented Generation (RAG) systems mitigate this by augmenting the LLM with the ability to retrieve context and information from a specific knowledge database. While organizations are rapidly adopting RAG to pair the power of LLMs with their own proprietary data, hallucinations and logical errors remain a major problem. In one highly publicized case, a major airline lost a court case after their RAG system hallucinated important details of their refund policy.

To understand this issue, let’s first revisit how a RAG system works. When a user asks a question (”Is this refund eligible?”), the retrieval component searches the knowledge database for relevant information needed to respond accurately. The most relevant search results are formatted into a context which is fed along with the user’s question into a LLM that generates the response presented to the user.

Because enterprise RAG systems are often complex, the final response might be incorrect for several reasons:

As machine learning models, LLMs are fundamentally brittle and prone to hallucination. Even when the retrieved context contains the correct answer within it, the LLM may fail to generate an accurate response, especially if synthesizing the response requires reasoning across different facts within the context.
The retrieved context may not contain information required to accurately respond, due to suboptimal search, poor document chunking/formatting, or the absence of this information within the knowledge database. In such cases, the LLM may still attempt to answer the question and hallucinate an incorrect response. For instance when asked to recommend a product: without being given relevant product information, the LLM may recommend a product from its training set - possibly a competitor’s!

While some use the term hallucination to refer only to specific types of LLM errors, here we use this term synonymously with incorrect response. What matters to the users of your RAG system is the accuracy of its answers and being able to trust them. Unlike RAG benchmarks that assess many system properties, we exclusively study: how effectively different detectors alert your RAG users when the answers are incorrect. A RAG answer might be incorrect due to problems during retrieval or generation. Our study focuses on methods to diagnose the latter issue, which stems from the fundamental unreliability of LLMs.

The Solution: Hallucination Detection Methods

Throughout our benchmark, we suppose an existing retrieval system has already fetched the context most relevant to a user’s question. We study algorithms to detect when the LLM response generated based on this context should not be trusted. Such hallucination detection algorithms are critical in high-stakes applications spanning medicine, law, or finance. Beyond flagging untrustworthy responses for more careful human review, such methods can be used to determine when it is worth executing more expensive retrieval steps (e.g. searching additional data sources, rewriting queries).

Here are the hallucination detection methods considered in our study, all based on using LLM-based techniques to score the generated response to a user query:

G-Eval (from the DeepEval package) is a method that uses CoT to automatically develop multi-step criteria for assessing the quality of a given response. In the G-Eval paper (Liu et al.), this technique was found to correlate with Human Judgement on several benchmark datasets. Quality can be measured in various ways specified as a LLM prompt, here we specify it should be assessed based on the factual correctness of the response.

The Hallucination Metric (also from the DeepEval package) estimates the likelihood of hallucination as the degree to which the LLM response contradicts/disagrees with the context, as assessed by an LLM (In our case, GPT-4o-mini). We take the complement of the hallucination metric ( $1 - score$ ) to make it consistent with our other hallucination-detection scores (where lower values indicate greater likelihood of hallucination).

RAGAS is a RAG-specific, LLM-powered evaluation suite that provides various scores which can be used to detect hallucination. We consider the following RAGAS metrics:

Faithfulness - The fraction of claims in the answer that are supported by the provided context.
Answer Relevancy - The average semantic similarity between the original question and three LLM-generated questions from the answer, measured via cosine similarity between vector embeddings of each question. RAGAS employs the BAAI/bge-base-en encoder embedding model.

These scores can be used to detect cases in which the generated answer is not supported by the retrieved context or not particularly relevant to the user’s question. The likelihood that a hallucination occurred is high in either case. We additionally evaluated the Context Utilization score from RAGAS, but observed it to be ineffective for hallucination detection.

RAGAS++: After observing some undesirable outcomes from the original RAGAS techinque, we developed a refined variant of this technique, here called RAGAS++. We used the gpt-4o-mini LLM throughout, instead of RAGAS’ default gpt-3.5-turbo-16k and gpt-4 for the generation and critic LLM. We also added a full stop character (.) to the end of each answer if not already present, observing that this reduced software failures in the RAGAS code due to its sentence parsing logic (more details below).

Self-Evaluation (a.k.a. Self-Reflection or LLM as a judge) is a simple technique whereby the LLM is directly asked to evaluate the generated answer and rate its confidence on a scale of 1-5 (Likert scale). We utilize chain-of-thought (CoT) prompting to improve this technique, asking the LLM to explain its reasoning before outputting a score. Here is the specific prompt template used:

Prompt for Likert-scale scoring for Self-Evaluation

Question: {question}
Answer: {response}

Evaluate how confident you are that the given Answer is a good and accurate response to the Question.
Please assign a Score using the following 5-point scale:
1: You are not confident that the Answer addresses the Question at all, the Answer may be entirely off-topic or irrelevant to the Question.
2: You have low confidence that the Answer addresses the Question, there are doubts and uncertainties about the accuracy of the Answer.
3: You have moderate confidence that the Answer addresses the Question, the Answer seems reasonably accurate and on-topic, but with room for improvement.
4: You have high confidence that the Answer addresses the Question, the Answer provides accurate information that addresses most of the Question.
5: You are extremely confident that the Answer addresses the Question, the Answer is highly accurate, relevant, and effectively addresses the Question in its entirety.
The output should strictly use the following template: Explanation: [provide a brief reasoning you used to derive the rating Score] and then write ‘Score: <rating>’ on the last line.

Trustworthy Language Model (TLM) is a model uncertainty-estimation technique that can wrap any LLM model, as well as estimate the trustworthiness of responses from any other LLM. TLM scores how trustworthy each LLM response is via a combination of self-reflection, consistency across multiple sampled responses, and probabilistic measures. This approach can flag scenarios where LLMs made reasoning/factuality errors, identified multiple contradictory yet plausible responses to a question, or were given atypical prompts relative to their original training data.

Evaluation Methodology

We compare these hallucination detection methods across four public Context-Question-Answer datasets that span different RAG applications.

For each user question in our benchmark, there is some relevant context (e.g. produced by a retrieval system), as well as a generated response (e.g. produced by an LLM based on the question and context, often along with an application-specific system prompt). Each hallucination detection method takes in the [user query, retrieved context, LLM response] and returns a score between 0-1, indicating the likelihood of hallucination. You can compute the scores in real-time to automatically flag untrustworthy responses from your RAG system as they are generated.

To evaluate these hallucination detectors, we consider how reliably these scores take lower values when the LLM responses are incorrect vs. being correct. In each of our benchmarks, there exist ground-truth annotations regarding the correctness of each LLM response, which we solely reserve for evaluation purposes. We evaluate hallucination detectors based on AUROC, defined as the probability that their score will be lower for an example drawn from the subset where the LLM responded incorrectly than for one drawn from the subset where the LLM responded correctly. Detectors with greater AUROC values will more accurately catch RAG errors in your production system (i.e. with greater precision/recall).

All of the considered hallucination detection methods are themselves powered by a LLM. For fair comparison, we fix this LLM model to be gpt-4o-mini across all of the methods (even though for instance the default LLM for RAGAS is gpt-3.5-turbo-16k, which produced worse results).

Benchmark Results

We describe each benchmark dataset and the corresponding results below. These datasets stem from the HaluBench benchmark suite (we do not include the other two datasets from this suite, HaluEval and RAGTruth, as we discovered significant errors in their ground truth annotations). If a scoring method failed to run for one or more examples in any dataset (due to software failure), we indicate this by a dotted line in the graph of results.

FinanceBench

FinanceBench is a RAG benchmark based on public financial statements. Each instance in the dataset contains a large retrieved context of plaintext financial information (e.g., The Kraft Heinz Company Consolidated Balance Sheets (in millions of dollars) January 3, 2016 December 28, 2014 ASSETS Cash and cash equivalents $ 4,837 $ 2,298 Trade receivables... ), a question (e.g., What is FY2015 net working capital for Kraft Heinz? ), and a generated answer (e.g., $2850.00 ). Hallucinated responses here often contain an incorrect number.

For each hallucination scoring method, the ROC plot above shows the True-vs-False positive rate of flagging LLM outputs as erroneous at varying hallucination score thresholds. Higher curves correspond to methods which are able to detect LLM errors with greater precision/recall.

For the FinanceBench application, TLM is the most effective method for detecting hallucinations. Besides the basic Self-Evaluation technique, most other methods struggled to provide significant improvements over random guessing, highlighting the challenges in this dataset that contains large amounts of context and numerical data. RAGAS particularly struggled, often failing to internally produce the LLM statements necessary for its metric computations. We observed that RAGAS tends to be more effective when the answers are in complete sentences, whereas the answers to FinanceBench questions are often single numbers. The default version of RAGAS Faithfulness failed to produce any score for 83.5% of the examples, while our improved RAGAS++ version generated a score for all of the examples (although this fix hardly increased overall performance).

Pubmed QA

Pubmed QA is a biomedical Q&A dataset based on PubMed abstracts. Each instance in the dataset contains a passage from a PubMed (medical publication) abstract, a question derived from passage (e.g., Is a 9-month treatment sufficient in tuberculous enterocolitis? ), and a LLM generated answer.

In this application, TLM is again overall the most effective method for detecting hallucinations. Other moderately effective methods include: the DeepEval Hallucination metric, RAGAS Faithfulness, and LLM Self-Evaluation.

DROP

DROP, or “Discrete Reasoning Over Paragraphs”, is an advanced Q&A dataset based on Wikipedia articles. DROP is difficult in that the questions require reasoning over context in the articles as opposed to simply extracting facts. For example, given context containing a Wikipedia passage describing touchdowns in a Seahawks vs. 49ers Football game, one question is: How many touchdown runs measured 5-yards or less in total yards?, requiring the LLM to read each touchdown run and then compare the length against the 5-yard requirement.

Given the difficulty of questions in DROP, many methods were less effective for catching hallucinations in this application. TLM exhibited the best performance in this application, followed by our improved RAGAS metrics and LLM Self-Evaluation.

CovidQA

CovidQA is a Q&A dataset based on scientific articles related to COVID-19. Compared to DROP, CovidQA contains simpler problems that typically require only a simple synthesis of information in a paper to directly answer more straightforward questions. For example: How much similarity the SARS-COV-2 genome sequence has with SARS-COV?

For this application, RAGAS Faithfulness performs relatively well, but still remains less effective than TLM. None of the other methods was able detect hallucinations remotely as well in this application.

Discussion

Across all four RAG benchmarks, the Trustworthy Language Model consistently catches hallucinations with greater precision/recall than other LLM-based methods. TLM can be used to score the trustworthiness of responses from any LLM, and can be wrapped around any LLM to obtain model uncertainty estimates. Today’s lack of trustworthiness limits the ROI of enterprise AI — TLM offers an effective way to tackle this issue and achieve trustworthy RAG with comprehensive hallucination detection.

RAGAS Faithfulness proved moderately effective for catching hallucinations in applications with simple search-like queries, but not when the questions were more complex. The basic LLM Self-Evaluation technique also proved moderately effective, demonstrating how LLMs can sometimes directly catch response errors, although not reliably across many diverse responses. Other methods like G-eval and the DeepEval Hallucination metric exhibited less consistent effectiveness, indicating further refinement and adaptation of these methods is needed to use them for real-time hallucination detection in your RAG application.

RAGAS is a popular framework for assessing RAG systems, so we expected it to perform well. We encountered two surprises. First, that RAGAS Answer Relevancy was mostly ineffective for detecting hallucinations, despite being a popular metric to assess the generation phase of RAG. The poor hallucination-detection performance of this metric indicates that most hallucinated responses by modern LLMs are not entirely irrelevant to the query, they just fail to accurately answer it. In addition, we encountered persistent software issues while running RAGAS (such as the internal error: No statements were generated from the answer). We overcame these issues in our RAGAS++ variant of this technique, but also report the original default behavior in our study to reflect what a developer might experience when running RAGAS in production.

The following table reports our observed failure rate (software returned an error instead of a score for a particular example) when running RAGAS Faithfulness across each dataset:

Dataset	RAGAS Null Result %	RAGAS++ Null Result %
DROP	58.90%	0.10%
RAGTruth	0.70%	0.00%
FinanceBench	83.50%	0.00%
PubMedQA	0.10%	0.00%
CovidQA	21.20%	0.00%

Beyond the techniques considered in this article, real-time Evaluation models offer another way to detect hallucations. One study benchmarked many Evaluation models, including Patronus Lynx, Prometheus 2, HHEM, finding that TLM detects incorrect RAG responses with universally higher precision/recall. Another study found that TLM detects incorrect responses more effectively than techniques like LLM-as-a-judge or token probabilities (logprobs), across all major LLM models.

Resources to learn more

Code to reproduce these benchmarks - Beyond reproducing the results of our study, you can also use this code to test your own hallucination detectors.
Quickstart tutorial - Build your own trustworthy RAG application within minutes.

Browse all Next

Introducing cleanlab's dual new methods to detect outliers and how they perform on real image data.

Reliable Agentic RAG with LLM Trustworthiness Estimates

Ensure reliable answers in Retrieval-Augmented Generation, while also ensuring that latency and compute costs do not exceed the processing needed to accurately respond to complex queries.

Detecting Annotation Errors in Semantic Segmentation Data

Introducing new methods for estimating labeling quality in image segmentation datasets.

Get started today

TLM is free to try and adds a reliabilty layer to RAG and GenAI systems in a few lines of code.

Try for free Contact sales

More resources

Explore applications of Cleanlab via blogs, tutorials, videos, and read the research that powers this next-generation platform.

Join us on Slack

Join the Cleanlab Community to ask questions and see how scientists and engineers are practicing Data-Centric AI.