Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics
Abstract
Token-level hallucination detection is reformulated as a quickest change detection problem, revealing fundamental limits on detection delay and demonstrating superior performance through causal recurrent modeling.
Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment; at a matched false-alarm rate it detects in 11-13 tokens, against 31 for a linear per-token baseline, and a controlled decomposition attributes most of this advantage to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable
Community
Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment. Among the onsets it catches it detects in 11-13 tokens, against 31 for a linear per-token baseline, though at this false-alarm budget every detector catches under a third of onsets and the recall-honest delay is 56-66 tokens: low-false-alarm onset detection is hard. A controlled decomposition attributes the speed advantage mostly to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable.
This is a really interesting take on hallucination. Most papers just treat this as a static classification task, but framing it as a quickest change detection problem—like spotting a sensor shift—actually makes a lot of sense for streaming output.
I’m curious about that gap between the theoretical lower bound of 1.3 tokens and the 11-13 tokens the model achieves. Do you think that 1/4.5 divergence efficiency is something we can ever bridge with better training?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/89f4a384-3eff-4449-8bac-1f99f307b3e5
Thanks, Noah — this is exactly the question that kept me up, and I'm glad the QCD framing landed.
Short version: better training on the same features won't close it; better features will.
Here's how we read the gap. The 1.3-token floor is set by the KL divergence the features carry between the faithful and hallucinated regimes (≈3.5 nats). What a detector actually achieves is governed by the realized information rate of its score,
which we can measure directly — and the learned score recovers only about 1/4.5 of that divergence. That 4.5× is the multiplicative delay penalty. Two things about it:
- It's a property of the score's shape, not its scale. I checked: it's invariant to recalibration and barely moves under monotone reshaping. So "train longer / calibrate better / add depth" doesn't touch it. In our information-rate theorem the
realized rate equals the KL only when the score is affine in the true log-likelihood ratio — ours isn't, and the deficit is close to irreducible for these features. - The rest of the gap — from ≈6 tokens (1.3 × 4.5) to the 11–13 we observe — is a finite-horizon effect: the increments are strongly correlated and detection fires faster than the score mixes, so the asymptotic rate overshoots.
So the lever isn't a bigger model, it's features that separate the two regimes more sharply (push those 3.5 nats up). The label oracle sits at ≈1.0 token (4.6 nats) — roughly where perfect separation would land. That's the ceiling worth chasing.
One honest caveat I keep in the paper: at this false-alarm budget every detector still catches under a third of onsets at the first token, so the recall-honest delay is much larger. Closing the 4.5× speeds up the onsets you do catch; it doesn't fix
the miss rate, which is a separate axis.
And thanks for making the ResearchPod episode — that's a generous thing to do. I'll give it a listen.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Hallucination Detection via Activations of Open-Weight Proxy Analyzers (2026)
- Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry (2026)
- PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts (2026)
- TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection (2026)
- TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction (2026)
- Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics (2026)
- BEACON: Behavioral Entropy Aggregation for Cross-Model Hallucination Detection in Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.12476 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper