re paper
updated
Scaling RL to Long Videos
Paper
• 2507.07966
• Published
• 160
Group Sequence Policy Optimization
Paper
• 2507.18071
• Published
• 317
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement
Learning
Paper
• 2507.14111
• Published
• 25
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
Paper
• 2507.21183
• Published
• 15
SAND-Math: Using LLMs to Generate Novel, Difficult and Useful
Mathematics Questions and Answers
Paper
• 2507.20527
• Published
• 7
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
Paper
• 2507.16806
• Published
• 7
EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for
Advantage Diversity
Paper
• 2507.21848
• Published
• 9
Geometric-Mean Policy Optimization
Paper
• 2507.20673
• Published
• 32
A Survey of Self-Evolving Agents: On Path to Artificial Super
Intelligence
Paper
• 2507.21046
• Published
• 84
L0: Reinforcement Learning to Become General Agents
Paper
• 2506.23667
• Published
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for
RLVR
Paper
• 2507.15778
• Published
• 21
SRFT: A Single-Stage Method with Supervised and Reinforcement
Fine-Tuning for Reasoning
Paper
• 2506.19767
• Published
• 15
R-Search: Empowering LLM Reasoning with Search via Multi-Reward
Reinforcement Learning
Paper
• 2506.04185
• Published
TreeRPO: Tree Relative Policy Optimization
Paper
• 2506.05183
• Published
TreeRL: LLM Reinforcement Learning with On-Policy Tree Search
Paper
• 2506.11902
• Published
Enhancing Mathematical Reasoning in LLMs by Stepwise Correction
Paper
• 2410.12934
• Published
• 1
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Paper
• 2412.06559
• Published
• 86
tencent/WeDLM-8B-Instruct
Text Generation
• 8B • Updated
• 1.34k
• 310