Papers
arxiv:2604.02288

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Published on Apr 2
· Submitted by
Dan Zhang
on Apr 7
Authors:
,
,
,
,
,
,
,
,

Abstract

Sample-Routed Policy Optimization (SRPO) combines GRPO and SDPO advantages by routing correct samples to reward-aligned reinforcement and failed samples to targeted logit-level correction, achieving improved stability and performance in reinforcement learning with verifiable rewards.

AI-generated summary

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.

Community

Paper submitter

Hello everyone, welcome to follow our work: Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

GRPO applies a uniform penalty across failures (coarse signal), while SDPO provides dense logit-level guidance but suffers from target collapse in later training. SRPO (Sample-Routing Policy Optimization) unifies both by routing correct samples to GRPO for reward-aligned reinforcement and failed samples to SDPO for targeted logit-level correction, with entropy-aware dynamic weighting to suppress unreliable high-entropy targets.

Key Idea

The central observation is that GRPO and SDPO have complementary failure modes: GRPO’s reward signal is too coarse to guide recovery from errors, while SDPO’s self-distillation targets become unreliable as entropy grows during training. SRPO resolves this by splitting each training batch — correct samples follow the GRPO path and incorrect samples follow the SDPO path — so each optimization objective is applied only where it is most effective.

GRPOvsSDPO

Method / Approach

At each training step, SRPO classifies generated samples as correct or incorrect based on reward feedback and routes them to the appropriate loss. Correct samples receive the GRPO objective to reinforce successful reasoning trajectories. Incorrect samples receive the SDPO objective, which provides dense logit-level supervision to steer the model toward better solutions. An entropy-aware dynamic weighting mechanism monitors the self-distillation target’s reliability and down-weights high-entropy (unreliable) targets to prevent collapse.

SampleRouting

EntropyWeighting

Results

SRPO achieves the fast early improvement characteristic of SDPO combined with the long-horizon stability of GRPO. Evaluated across five reasoning benchmarks and two model scales, SRPO consistently outperforms both GRPO and SDPO used in isolation, demonstrating that intelligent sample routing is a simple but effective way to get the best of both worlds.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.02288
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.02288 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.02288 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.02288 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.