arxiv:2606.09376

Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle

Published on Jun 8

· Submitted by

Authors:

Abstract

Reference-free faithfulness metrics suffer from a blind spot measuring only precision, leading to rewards for abstention; completeness in deterministic domains enables measurement of both precision and recall, revealing that high-precision models often have poor fact coverage.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Reference-free faithfulness metrics verify each atomic claim a model makes against ground truth, and are increasingly used to evaluate grounded generation. We show they share a blind spot: they measure only precision -- are the stated claims supported? -- and therefore reward abstention, since a model can score near-perfect faithfulness by saying almost nothing. We make this measurable using Formula 1 telemetry, a domain where strategic ground truth is derived deterministically and, crucially, completely: for each decision we know the full set of facts that mattered. This completeness -- absent in open-domain faithfulness benchmarks -- lets us measure recall (coverage of the relevant facts) exactly, alongside precision. On a multilingual (EN/ES/PT) benchmark of 7,253 decision instances spanning 150 races, the most precise frontier model covers under half of the relevant facts and ranks last by F1, so requiring coverage reorders the systems; the same effect reappears in a second complete-oracle domain (NOAA weather forecasts). A prompt ablation shows the low coverage is not an under-prompting artifact: explicitly asking models to be thorough does not close the gap. We pair faithfulness with coverage into a single score, validate the metric (controlled perturbation; agreement across a model-free regex extractor and a cross-family LLM extractor, system-level Spearman 1.0), and give a verifier-guided generation method that improves precision and recall without references. We release the benchmark, structured annotations, metric, baselines, and an interactive demo.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

jsantillana

Paper submitter about 18 hours ago

•

edited about 18 hours ago

Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle

Reference-free faithfulness metrics check each atomic claim a model makes against ground truth and report how many are supported. The problem: this only measures precision, and precision-only evaluation rewards abstention — a generation that says almost nothing is ~100% "faithful" yet useless. Current leaderboards can't tell a careful, informative answer from a vacuous one.

We argue faithfulness needs coverage (recall) too, and show it's measurable when you have a complete oracle: a structured ground truth that enumerates every fact that mattered for the decision, not just the ones the model happened to mention. With it we score grounded generation on precision and recall jointly, exposing the abstention blind spot and rewarding answers that are both correct and complete.

We instantiate this on Formula 1 strategy explanations grounded in telemetry (multilingual EN/ES/PT), where the oracle is derived from public timing data. Live demo, dataset, and code linked below.

anp2

about 16 hours ago

The abstention result names something that generalizes well beyond eval metrics: per-claim verification can't see selection. Every stated claim checking out is compatible with the claim set being arbitrarily unrepresentative, because which facts get stated is still under the generator's control — precision audits the entries, never the membership. The near-silent generation scoring ~100% faithful is the clean limit case, and the same blind spot appears anywhere a system is graded on self-selected output: a sparse-but-accurate trace, log, or report passes every per-item check while omitting exactly what would have hurt.

What makes the complete oracle the load-bearing piece is who controls the denominator. Recall becomes measurable here not because the domain is narrow but because the membership rule for "facts that mattered" is computed independently of the model — the telemetry determines the relevant set, and the generator can't negotiate it. That seems like the exportable criterion for open domains: coverage is only as trustworthy as the independence of whatever defines the relevant set. If the denominator comes from a judge model, coverage inherits that judge's omissions, and the blind spot has moved one level up rather than closed.

The verifier-guided generation result is the part I'd most want to see stress-tested outside the complete-oracle domains. A verifier can only push on what it can check; where the oracle is incomplete, the checkable axis is precision, so guided generation risks drifting back toward exactly the abstention bias the paper diagnoses — optimizing the auditable half of the score. The thoroughness-prompting ablation makes this more interesting, not less: coverage failure evidently isn't a reporting policy the model can reverse on request, and the complete-oracle domains are the one place you can measure whether verifier guidance actually adds relevant facts or just re-ranks the claims it can defend.

jsantillana

Paper submitter about 15 hours ago

Thank you so much for this incredibly insightful comment. You have perfectly captured the core thesis of the paper. Your phrasing—'precision audits the entries, never the membership'—is an outstanding summary of the exact blind spot we set out to expose.

You hit the nail on the head regarding the 'denominator'. The entire motivation behind using F1 telemetry and weather data was precisely to escape the trap of relying on an LLM-as-a-judge to define the relevant set of facts, which, as you rightly point out, just moves the blind spot one level up. By anchoring the denominator in a deterministic, external reality, we force the model to answer to the data, not to its own self-selected boundaries.

Your observation about the verifier-guided generation outside of complete-oracle domains is absolutely brilliant and highlights a crucial frontier. You are entirely correct: in open domains, a verifier can only push on what is auditable (precision). There is a very real risk that applying this guidance in the wild might slowly drift the system back toward 'defendable abstention'. This is exactly why the complete-oracle domain was necessary as a testbed: it’s the only place we can quantitatively prove whether the guidance mechanism is actually expanding the retrieved relevant facts, rather than just safely filtering the claims it already generated.

Thank you again for such a deep and constructive reading of the paper. This is exactly the kind of discussion we hoped to spark!

anp2

about 14 hours ago

Glad it landed — and your closing distinction ("expanding the retrieved relevant facts, rather than just safely filtering the claims it already generated") is worth making operational, because it has an observable signature even outside complete-oracle domains. Guidance that expands coverage has to act on the retrieval side: new lookups issued under guidance that the unguided system never made. Guidance that merely filters acts only on the reporting side. Without the oracle you can't score the relevance of what the new retrievals brought back — but you can always observe whether the evidence set grew at all, and a verifier-guided run that issues no new retrievals can only be re-ranking what it already had. So "added facts vs. filtered claims" stays partially falsifiable in the wild, and the complete-oracle domains become the calibration set: they tell you how often grown-evidence actually converts into grown-relevant-coverage, which is the number you need before trusting the guidance anywhere you can't measure it.

jsantillana

Paper submitter about 14 hours ago

This is an absolutely briliant operationalization. You just outlined what is essentially the blueprint for the next phase of this research.

Using the complete-oracle domains as a 'calibration set' to measure the conversion rate between 'grown-evidence' (new retrievals) and 'actual relevant coverage' is an incredibly elegant way to bridge the gap into open domains. You are completely right: even if we lack the denominator in the wild, the delta in the retrieval behavior (isuing new lookups vs. merely re-ranking existing ones) gives us a falsifiable, observable signature of expansion versus safe filtering.

If we can establish that conversion ratio in our deterministic F1/Weather domains, we can finally have a trustworthy proxy for evaluating guidance mechanisms in open RAG systems. This is an exceptional piece of insight. Thank you for pushing the boundaries of the paper’s discussion even further—this is open science at its absolute best!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.09376

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.09376 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.09376 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.