openai/gsm8k
Benchmark • Updated • 17.6k • 931k • 1.37k
This is a LoRA adapter trained with GRPO (Group Relative Policy Optimization) on GSM8K math reasoning, released as part of the paper:
The GRPO Tax is Smaller Than You Think: A Longitudinal Study of Capability Preservation During Reasoning Training
This adapter was trained to study whether GRPO training for mathematical reasoning degrades other capabilities (the "alignment tax"). The key finding is that 78% of non-target capabilities are preserved within +/-2% of baseline after one epoch of GRPO training.
| Parameter | Value |
|---|---|
| Base model | microsoft/Phi-3.5-mini-instruct |
| Parameters | 3.8B |
| Method | GRPO with LoRA (r=16, alpha=32) |
| Dataset | openai/gsm8k (7,473 examples) |
| Epochs | 1 |
| Learning rate | 5e-6 (cosine) |
| Group size | 4 |
| Precision | bf16 |
| Hardware | NVIDIA RTX 5090 (32GB) |
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3.5-mini-instruct", torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "usama10/grpo-tax-phi-3.8b")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
| Resource | Link |
|---|---|
| Paper | Coming soon (TMLR submission) |
| All evaluation data | usama10/grpo-tax-eval-data |
| Source code | github.com/usama10/grpo-capability-tax |
| Other GRPO adapters | usama10/grpo-tax-qwen-1.5b, qwen-3b, phi-3.8b, gemma-2b, llama-3b |
| DPO adapters | usama10/grpo-tax-qwen-1.5b-dpo, qwen-3b-dpo |
Base model
microsoft/Phi-3.5-mini-instruct