RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Paper • 2504.20073 • Published • 12
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Complete research-backed training pipeline to improve mkurman/ConvGPT-0.2B-SYNTH-250B-EC for synthetic reasoning, agentic AI, multiturn conversations, and reduced repetition.
# 1. Install dependencies
pip install transformers trl torch datasets accelerate trackio
# 2. Phase 1: SFT on reasoning + multiturn (~2-3h on A10G)
python train_phase1_sft.py
# 3. Phase 2: DPO preference alignment (~1-2h on A10G)
python train_phase2_dpo.py
# 4. Evaluate
python evaluate_model.py
# 5. Chat with improved model
python inference_with_improved_settings.py
| Issue | Solution | Evidence |
|---|---|---|
| Weak step-by-step reasoning | SFT on 40k math CoT samples (OpenMathInstruct-1) | DeepSeek-R1, Fine-tune-CoT papers |
| Poor multiturn coherence | SFT on 12k Capybara conversations | TRL SFT quickstart |
| Repetitive outputs | DPO on UltraFeedback + repetition_penalty=1.15 |
DPO paper, DITTO paper |
| Low response quality | Preference alignment on chosen vs rejected | DPO (arXiv:2305.18290) |
Datasets (mixed 52k samples):
nvidia/OpenMathInstruct-1 — 40k filtered correct solutionstrl-lib/Capybara — 12k diverse multiturn conversationsKey technique: assistant_only_loss=True — only trains on assistant responses, not user prompts.
Config:
learning_rate=2e-5, num_train_epochs=3per_device_train_batch_size=4, gradient_accumulation_steps=4max_length=2048, bf16=TrueOutput: mkurman/ConvGPT-0.2B-SYNTH-250B-EC-Phase1-SFT
Dataset: HuggingFaceH4/ultrafeedback_binarized (train_prefs)
score_chosen > score_rejectedConfig:
learning_rate=1e-6, beta=0.1per_device_train_batch_size=2, gradient_accumulation_steps=8num_train_epochs=1Output: mkurman/ConvGPT-0.2B-SYNTH-250B-EC-Phase2-DPO
| Setting | Base Model | Improved |
|---|---|---|
repetition_penalty |
1.0 | 1.15 |
temperature |
default | 0.7 |
top_p |
default | 0.9 |
do_sample |
default | True |
top_k |
default | 50 |
| Metric | Expected Change |
|---|---|
| Repetition rate | -30-50% |
| Math reasoning | +10-20% |
| Multiturn coherence | Significant |
| Response quality | Improved via preference alignment |
| File | Purpose |
|---|---|
train_phase1_sft.py |
SFT on reasoning + multiturn |
train_phase2_dpo.py |
DPO preference alignment |
evaluate_model.py |
Evaluate repetition + quality |
inference_with_improved_settings.py |
Chat with optimal settings |
max_length to 4096 or 8192Apache-2.0 (same as base model)