vGROUT first-hack bootstrap (Qwen3-4B, seed 43)
Warm-start checkpoint for the vGROUT gradient-routing experiments on the ariahw/rl-rewardhacking LeetCode environment.
This is a 10-step GRPO checkpoint, saved the moment the first student
reward-hack appeared, with the warmup LoRA merged into the Qwen3-4B base. It sits
at the start of reward hacking: the model solves a fair fraction of problems and
has produced its first exploit of the run_tests loophole, but hacking has not
yet saturated.
At step 10 (training-time metrics):
- deploy solve (quarantine-ablated, held-out, T=0.7): ~0.09
- deploy hack: ~0.00 (first exploit just emerged on-policy)
- training pass rate ~0.375, training hack rate ~0.066
Why this checkpoint
The two-stage bootstrap splits capability warmup from routed RL: stage 1 warms a student, we merge that LoRA into the base to get a frozen M0, and stage 2 runs routed GRPO from M0 with a fresh adapter. Every arm branching from one frozen M0 makes the placebo-versus-real comparison exact.
The companion step-20 checkpoint is too saturated for a warm start (it deploy-hacks ~0.84), so this earlier first-hack point is the default warm start.
How it was made
scripts/merge_bootstrap.py reads the run's ckpt_update0000 (the init A0/B0)
and first_hack.safetensors, computes the per-module lora2r delta
(B@A - B0@A0), and adds it to the base weights (252 target Linears). No
ground-truth rollout labels are used; the warmup teacher demos are off-distribution.
Links
- Project / code: https://github.com/wassname/vGROUT
- Environment: https://github.com/ariahw/rl-rewardhacking
- Teacher demos used for warmup: https://huggingface.co/datasets/wassname/vgrout-leetcode-teacher-demos
- Saturated companion (step 20): https://huggingface.co/wassname/vgrout-bootstrap-leetcode-s43
- Downloads last month
- 95