Papers
arxiv:2605.30789

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Published on Jun 2
· Submitted by
qishisuren
on Jun 15
Authors:
,
,
,
,
,
,
,
,

Abstract

Small-to-Large Policy Optimization framework uses smaller models as natural explorers to enhance policy diversity and improve large language model training efficiency.

We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.

Community

Paper submitter

We recently open-sourced S2L-PO (Small-to-Large Policy Optimization). S2L-PO addresses the issue of rollout diversity in reinforcement learning for large model inference. Its core idea is to leverage the inherent diversity of small models to act as "explorers," providing richer and more stable policy-level exploration signals for training the large model, while gradually transitioning back to the large model's own sampling via progressive annealing. Compared to simply increasing token-level randomness, this approach better preserves the coherence of inference trajectories while enhancing both training efficiency and final performance. We have validated the method's effectiveness across multiple mathematical reasoning benchmarks and have made the relevant resources publicly available.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.30789
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30789 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30789 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30789 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.