V_1: Unifying Generation and Self-Verification for Parallel Reasoners
Abstract
Test-time scaling methods for complex reasoning tasks are enhanced through unified generation and verification frameworks that leverage pairwise self-verification, achieving improved performance in code generation and mathematical reasoning benchmarks.
Test-time scaling for complex reasoning tasks shows that leveraging inference-time compute, by methods such as independently sampling and aggregating multiple solutions, results in significantly better task outcomes. However, a critical bottleneck is verification: sampling is only effective if correct solutions can be reliably identified among candidates. While existing approaches typically evaluate candidates independently via scalar scoring, we demonstrate that models are substantially stronger at pairwise self-verification. Leveraging this insight, we introduce V_1, a framework that unifies generation and verification through efficient pairwise ranking. V_1 comprises two components: V_1-Infer, an uncertainty-guided algorithm using a tournament-based ranking that dynamically allocates self-verification compute to candidate pairs whose relative correctness is most uncertain; and V_1-PairRL, an RL framework that jointly trains a single model as both generator and pairwise self-verifier, ensuring the verifier adapts to the generator's evolving distribution. On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being significantly more efficient. Furthermore, V_1-PairRL achieves 7--9% test-time scaling gains over standard RL and pointwise joint training, and improves base Pass@1 by up to 8.7% over standard RL in a code-generation setting.
Community
Can LLMs self-verify their own solutions?
Modern LLMs are increasingly used as parallel reasoners, where multiple candidate solutions are sampled and then filtered or aggregated. In this setting, the key bottleneck is often not generation but verification—identifying which candidate is actually correct.
In this work, we show that pairwise self-verification is a surprisingly strong primitive. Instead of scoring candidates independently, models compare solutions against each other, which provides a much stronger verification signal.
We introduce V1, a framework that unifies generation and self-verification:
• V1-Infer: an efficient Swiss-system style tournament that ranks candidate solutions through pairwise comparisons while focusing comparisons on uncertain pairs.
• V1-PairRL: a reinforcement learning framework that co-trains a single model as both a generator and a pairwise self-verifier, allowing generation and verification to improve together.
Across code and math benchmarks we observe substantial improvements in test-time scaling (up to +19.6% Pass@1 on code and +17.9% on math). We also show that pairwise verification integrates well and improves aggregation-based methods such as Recursive Self-Aggregation, improving convergence and latency.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Agentic Proposing: Enhancing Large Language Model Reasoning via Compositional Skill Synthesis (2026)
- Pushing the Boundaries of Natural Reasoning: Interleaved Bonus from Formal-Logic Verification (2026)
- Adaptive Test-Time Compute Allocation via Learned Heuristics over Categorical Structure (2026)
- JudgeRLVR: Judge First, Generate Second for Efficient Reasoning (2026)
- ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models (2026)
- Learning to Self-Verify Makes Language Models Better Reasoners (2026)
- Efficient Paths and Dense Rewards: Probabilistic Flow Reasoning for Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper