Real-Time Error Detection for LLM Structured Outputs: A Comprehensive Benchmark

LLMs can turn unstructured text into structured, business-ready data, but even top models still make errors. This article shows how to score the trustworthiness of Structured Outputs from any LLM in real-time, to automatically catch errors and flag which LLM outputs (and fields within outputs) warrant human review. Running Structured Output benchmarks across 4 datasets and 5 LLMs, we find that our trust scores detect LLM output errors with 25% greater precision/recall than alternative scores including: LLM-as-a-judge, separate LLM-judges for each field, and Token Log Probabilities.

AUROC Per-Document Scoring Results — How effectively do different per-document scoring techniques detect incorrect LLM structured outputs. Here a document's LLM output is deemed incorrect if any of its fields are wrong, the subtitle of each graph indicates which model produced the outputs, and detection effectiveness is measured via AUROC.

AUROC Per-Field Scoring Results — How effectively (in terms of AUROC) different per-field scoring techniques detect individual fields that are erroneous within LLM structured outputs.

When used for data extraction or embedded within software systems, LLMs are increasingly asked to generate structured outputs (JSON/dictionary objects composed of individual named fields) rather than plain language. For instance: many teams have to perform labor-intensive data-extraction or document-processing tasks, where it’s easy to prototype a LLM application with OpenAI/Anthropic + Structured Outputs that seems like it could automate this work. Unfortunately once deployed at scale, automation stalls because the LLM inevitably encounters a long tail of edge-cases where its output is unreliable. If using trust scoring to flag problematic outputs and fields, then these teams can focus human review on the 1-5% of the cases where the LLM is untrustworthy, allowing 95-99% of their work to be accurately LLM-automated.

Thus being able to effectively score the trustworthiness of LLM Structured Outputs can be extremely valuable. In this article, we benchmark different techniques for scoring Strutured Outputs from various LLM models. We evaluate how effectively different scores can detect incorrect LLM outputs and specific incorrect fields within those outputs, and also considering cost and interpretability trade-offs.

Trust Scoring with TLM

Cleanlab’s Trustworthy Language Model (TLM) is a real-time LLM uncertainty-estimator that can use any base LLM model to score the trustworthiness of responses from any LLM (including black-box LLMs from APIs like OpenAI/Anthropic). TLM can also assign a trustworthiness score to structured outputs from any LLM, and to each field within a structured output. Per-field trust scores provide more targeted oversight, pointing reviewers directly to the specific fields that warrant human attention. In document-processing applications, the documents with the lowest trust scores are likely to contain some error in their corresponding LLM structured output, while the per-field scores reveal exactly where the errors occur.

TLM works for arbitrary types of structured outputs that may include numbers, categories, lists, text, and complex nested JSON schemas. Using TLM is easy via the Python API (see our tutorial for full code).

For example, consider the following user input (provided along with a specified JSON output schema):

Extract the vendor name, invoice date, total amount, and currency from the following invoice: Payment of $1,530.00 USD was issued to Brightstone Manufacturing for the invoice dated February 12, 2024.

An LLM might generate the following structured output, containing a subtly incorrect amount and an invalid date:

json

0
1
2
3
4
5
6
{  "vendor": "Brightstone Manufacturing",  "invoice_date": "2024-02-31",  "total_amount": "1530.50",  "currency": "USD"}

Here’s how you could generate that output from an OpenAI LLM:

python

0
1
2
3
4
5
6
7
8
9
10
11
12
openai_kwargs = {    "model": "gpt-4.1-mini",    "messages": [        {            "role": "user",            "content": "Extract the vendor name, invoice date, total amount, and currency from the following invoice: Payment of $1,530.00 USD was issued to Brightstone Manufacturing for the invoice dated February 12, 2024."        }    ],    "response_format": TransactionResponse,  # this can be a Pydantic object or JSON schema}
response = client.chat.completions.parse(**openai_kwargs)

You can then simply pass the original OpenAI arguments and the response obtained from OpenAI to Cleanlab’s TLM:

python

0
1
2
3
4
5
6
7
8
from cleanlab_tlm.utils.chat_completions import TLMChatCompletion
tlm = TLMChatCompletion(options={"log": ["per_field_score"]})
result = tlm.score(    response=response,  # response is ChatCompletion object from OpenAI obtained above    **openai_kwargs  # openai_kwargs are arguments used in the original OpenAI call)

TLM will score the OpenAI response and provide an overall trustworthiness score:

python

0
1
2
3
result['trustworthiness_score']
>> 0.18

This low score signals that the structured output contains potential errors and should not be trusted. Cleanlab also provides a per-field breakdown along with automatically-provided explanations:

python

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
result['log']['per_field_score']
>>  {>>    "vendor": {>>      "explanation": "The vendor name matches the entity mentioned in the input text.",>>      "score": 0.99>>    },>>    "invoice_date": {>>      "explanation": "The extracted date '2024-02-31' is invalid and does not match the actual date 'February 12, 2024' in the text.",>>      "score": 0.03>>    },>>    "total_amount": {>>      "explanation": "The extracted amount ('1530.50') does not match the correct total ('$1,530.00').",>>      "score": 0.16>>    },>>    "currency": {>>      "explanation": "The currency 'USD' is correctly extracted from the text.",>>      "score": 0.98>>    }>>  }

Here, the per-field scores make the model’s mistakes immediately clear. The vendor and currency fields receive high scores because they match the text exactly. In contrast, the date receives a very low score because the model generated an impossible value, and the total amount is flagged as low-trust due to a numerical error.

To streamline human-in-the-loop reviews, TLM also provides a method to print: the list of fields that warrant manual review (due to low trust scores) along with detailed information about each low-confidence field.

python

0
1
2
3
4
5
6
7
8
9
10
11
12
13
untrustworthy_fields = tlm.get_untrustworthy_fields(tlm_result=result)
>>  Untrustworthy fields: ['invoice_date', 'total_amount']>>>>  Field: invoice_date>>  Response: 2024-02-31>>  Score: 0.03>>  Explanation: The extracted date '2024-02-31' is invalid and does not match the actual date 'February 12, 2024' in the text.>>>>  Field: total_amount>>  Response: 1530.50>>  Score: 0.16>>  Explanation: The extracted amount ('1530.50') does not match the correct total ('$1,530.00').

Trust Scoring with Token Log-Probabilities

An alternative way to score confidence in LLM outputs is via the Average Log Probability of the Tokens generated by the model (i.e. logprobs / perplexity). This score quantifies the likelihood of this output according to the LLM that generated it. These probabilities can fail to detect cases when the LLM does not know it does not know (high epistemic uncertainty due to lack of similar training data), and can be suboptimal for open-domain text fields where the same statement can be expressed many ways. Furthermore, many frontier LLM providers such as Anthropic do not provide access to these token-probabilities. On the date we ran our benchmarks, we could not get them from GPT-5 or Gemini-3-Pro despite them being available for other Google/OpenAI models.

Trust Scoring with LLM-as-a-judge

LLM-as-a-judge is another popular approach to score LLM outputs via an extra LLM request to explicitly evaluate the original LLM’s output. Like token-probabilities, this approach can be easily applied to structured outputs. Here we specifically consider the well-known LLM-as-Judge method from Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, where the LLM produces an overall correctness score (the LLM-judge takes in: a user request, containing all of the relevant information including the desired structured-output schema, and the model’s structured-output response to this user request).

While scoring the overall LLM structured output is straightforward, there are different ways to produce per-field scores using LLM-as-a-judge. Our benchmarks study the following two approaches.

Single Call: This approach uses one LLM-judge call to simultaneously score all fields in the structured output, where the LLM-judge also generates a JSON structured output containing one score per field.

Multiple Calls: This approach makes separate LLM-judge calls to assess each field, producing one score per call. This allows the LLM-judge to focus on a single field at a time, leading to more precise and targeted assessment. But multiple calls is slower/costlier than a single call LLM-as-a-judge. Some of our enterprise customers also faced token-rate-limit problems when attempting to use this approach in financial document processing applications involving 50+ output fields.

Benchmarking Scoring Methods

We run the Structured Outputs benchmark that is comprised of 4 datasets from diverse LLM Structured Outputs applications. While other public structured outputs datasets have been found to be riddled with mistakes, this benchmark suite offers high-quality ground-truth for reliable assessment.

To compare different scoring techniques for LLM Structured Outputs, we first generated such outputs from many different models for each dataset: GPT-4.1-mini, GPT-5, Gemini-2.5-Flash, Gemini-2.5-Pro and Gemini-3-Pro. For results where the output is generated by GPT-5 or Gemini-3-Pro, Token-Probability-based scoring is omitted because these models did not expose token-level log probabilities at the time we ran them.

While the LLM outputs are generated using various models, all scoring of these outputs is peformed using TLM powered by GPT-4.1-mini as its base LLM. For consistency, all LLM-as-Judge evaluations also use GPT-4.1-mini to power the LLM-judge.

We evaluate score-based detectors by their Area Under the Receiver Operating Characteristic Curve (AUROC), which measures how well their scores rank incorrect outputs above correct ones. Higher AUROC means a detector both catches more incorrect outputs (higher recall) and is more accurate when flagging them (higher precision).

Results

The graphs at the top of this article present our results evaluating both: scores for each overall LLM output (per input document to process), as well as scores for each output field.

Across all five models considered in our benchmark, Cleanlab’s per-document trust scores achieve the highest AUROC on every benchmark and by a substantial margin. This detector consistently exhibits the best precision/recall for catching incorrect structured outputs. LLM-as-Judge delivers only moderate performance, while scoring based the average token log probability is inconsistently effective, varying widely across models.

For per-field scoring to detect which specific output fields are erroneous, the multiple-call LLM-as-Judge method consistently outperforms the single-call version (but at the cost of making many more LLM calls). TLM consistently outperforms both LLM-as-Judge baselines while consuming less tokens/compute than the multiple-call approach. This efficiency advantage grows even more significant as the number of fields increases, since methods that require a separate model call per field will scale poorly for large JSON outputs.

Discussion

As enterprise LLM applications increasingly rely on structured outputs, the demand for effective trust assessment is more important than ever. Traditional methods like perplexity and LLM-as-Judge scores are suboptimal for the nuanced, field-by-field correctness required in these workflows. Cleanlab provides a scalable, real-time solution: trust scores that accurately flag incorrect outputs, pinpointing which fields may be wrong (via per-field scores) and why (via automated explanations).

Easily score the trustworthiness of your own LLM Structured Outputs via our tutorial.

Appendix

Here we report additional metrics that evaluate other facets of structured output scoring techniques.

Example LLM Errors Found using Cleanlab Trust Scoring

Here we showcase example errors that LLMs made in this benchmark. All examples in this section are from structured outputs generated by the GPT-4.1-mini LLM.

Browse all Next

A case study on a reliable Customer Support Agent built with LangGraph and automated trustworthiness scoring

LLM Structured Output Benchmarks are Riddled with Mistakes

Existing Structured Outputs datasets are unreliable, so we created four new ones.

Expert Guidance: Teaching Your AI How to Behave

Once your AI agents are live, the hard part begins: keeping them reliable. Cleanlab’s new Expert Guidance feature shows how non-engineers can teach AI systems to think and act better instantly, in natural language.

Real-Time Error Detection for LLM Structured Outputs: A Comprehensive Benchmark

Trust Scoring with TLM

Trust Scoring with Token Log-Probabilities

Trust Scoring with LLM-as-a-judge

Benchmarking Scoring Methods

Results

Discussion

Appendix

Example LLM Errors Found using Cleanlab Trust Scoring

Platform

Resources

Community

Company