GRPO Tax Study: phi-3.8b LoRA Adapter

This is a LoRA adapter trained with GRPO (Group Relative Policy Optimization) on GSM8K math reasoning, released as part of the paper:

The GRPO Tax is Smaller Than You Think: A Longitudinal Study of Capability Preservation During Reasoning Training

What is this?

This adapter was trained to study whether GRPO training for mathematical reasoning degrades other capabilities (the "alignment tax"). The key finding is that 78% of non-target capabilities are preserved within +/-2% of baseline after one epoch of GRPO training.

Training Details

Parameter	Value
Base model	microsoft/Phi-3.5-mini-instruct
Parameters	3.8B
Method	GRPO with LoRA (r=16, alpha=32)
Dataset	openai/gsm8k (7,473 examples)
Epochs	1
Learning rate	5e-6 (cosine)
Group size	4
Precision	bf16
Hardware	NVIDIA RTX 5090 (32GB)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3.5-mini-instruct", torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "usama10/grpo-tax-phi-3.8b")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")

Related Resources

Resource	Link
Paper	Coming soon (TMLR submission)
All evaluation data	usama10/grpo-tax-eval-data
Source code	github.com/usama10/grpo-capability-tax
Other GRPO adapters	usama10/grpo-tax-qwen-1.5b, qwen-3b, phi-3.8b, gemma-2b, llama-3b
DPO adapters	usama10/grpo-tax-qwen-1.5b-dpo, qwen-3b-dpo

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for usama10/grpo-tax-phi-3.8b

Base model

microsoft/Phi-3.5-mini-instruct

Adapter

(710)

this model

usama10
/

grpo-tax-phi-3.8b