arxiv:2604.02288

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Published on Apr 2

· Submitted by

Authors:

Abstract

Sample-Routed Policy Optimization (SRPO) combines GRPO and SDPO advantages by routing correct samples to reward-aligned reinforcement and failed samples to targeted logit-level correction, achieving improved stability and performance in reinforcement learning with verifiable rewards.

AI-generated summary

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.

View arXiv page View PDF Add to collection

Community

zd21

Paper submitter 1 day ago

Hello everyone, welcome to follow our work: Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

librarian-bot

about 17 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

mishig

about 3 hours ago

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

GRPO applies a uniform penalty across failures (coarse signal), while SDPO provides dense logit-level guidance but suffers from target collapse in later training. SRPO (Sample-Routing Policy Optimization) unifies both by routing correct samples to GRPO for reward-aligned reinforcement and failed samples to SDPO for targeted logit-level correction, with entropy-aware dynamic weighting to suppress unreliable high-entropy targets.

Key Idea

The central observation is that GRPO and SDPO have complementary failure modes: GRPO’s reward signal is too coarse to guide recovery from errors, while SDPO’s self-distillation targets become unreliable as entropy grows during training. SRPO resolves this by splitting each training batch — correct samples follow the GRPO path and incorrect samples follow the SDPO path — so each optimization objective is applied only where it is most effective.

Method / Approach

At each training step, SRPO classifies generated samples as correct or incorrect based on reward feedback and routes them to the appropriate loss. Correct samples receive the GRPO objective to reinforce successful reasoning trajectories. Incorrect samples receive the SDPO objective, which provides dense logit-level supervision to steer the model toward better solutions. An entropy-aware dynamic weighting mechanism monitors the self-distillation target’s reliability and down-weights high-entropy (unreliable) targets to prevent collapse.

Results

SRPO achieves the fast early improvement characteristic of SDPO combined with the long-horizon stability of GRPO. Evaluated across five reasoning benchmarks and two model scales, SRPO consistently outperforms both GRPO and SDPO used in isolation, demonstrating that intelligent sample routing is a simple but effective way to get the best of both worlds.