vGROUT first-hack bootstrap (Qwen3-4B, seed 43)

Warm-start checkpoint for the vGROUT gradient-routing experiments on the ariahw/rl-rewardhacking LeetCode environment.

This is a 10-step GRPO checkpoint, saved the moment the first student reward-hack appeared, with the warmup LoRA merged into the Qwen3-4B base. It sits at the start of reward hacking: the model solves a fair fraction of problems and has produced its first exploit of the run_tests loophole, but hacking has not yet saturated.

At step 10 (training-time metrics):

  • deploy solve (quarantine-ablated, held-out, T=0.7): ~0.09
  • deploy hack: ~0.00 (first exploit just emerged on-policy)
  • training pass rate ~0.375, training hack rate ~0.066

Why this checkpoint

The two-stage bootstrap splits capability warmup from routed RL: stage 1 warms a student, we merge that LoRA into the base to get a frozen M0, and stage 2 runs routed GRPO from M0 with a fresh adapter. Every arm branching from one frozen M0 makes the placebo-versus-real comparison exact.

The companion step-20 checkpoint is too saturated for a warm start (it deploy-hacks ~0.84), so this earlier first-hack point is the default warm start.

How it was made

scripts/merge_bootstrap.py reads the run's ckpt_update0000 (the init A0/B0) and first_hack.safetensors, computes the per-module lora2r delta (B@A - B0@A0), and adds it to the base weights (252 target Linears). No ground-truth rollout labels are used; the warmup teacher demos are off-distribution.

Links

Downloads last month
95
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wassname/vgrout-bootstrap-firsthack-s43

Finetuned
Qwen/Qwen3-4B
Finetuned
(724)
this model