vGROUT first-hack bootstrap (Qwen3-4B, seed 43)

Warm-start checkpoint for the vGROUT gradient-routing experiments on the ariahw/rl-rewardhacking LeetCode environment.

This is a 10-step GRPO checkpoint, saved the moment the first student reward-hack appeared, with the warmup LoRA merged into the Qwen3-4B base. It sits at the start of reward hacking: the model solves a fair fraction of problems and has produced its first exploit of the run_tests loophole, but hacking has not yet saturated.

At step 10 (training-time metrics):

deploy solve (quarantine-ablated, held-out, T=0.7): ~0.09
deploy hack: ~0.00 (first exploit just emerged on-policy)
training pass rate ~0.375, training hack rate ~0.066

Why this checkpoint

The two-stage bootstrap splits capability warmup from routed RL: stage 1 warms a student, we merge that LoRA into the base to get a frozen M0, and stage 2 runs routed GRPO from M0 with a fresh adapter. Every arm branching from one frozen M0 makes the placebo-versus-real comparison exact.

The companion step-20 checkpoint is too saturated for a warm start (it deploy-hacks ~0.84), so this earlier first-hack point is the default warm start.

How it was made

scripts/merge_bootstrap.py reads the run's ckpt_update0000 (the init A0/B0) and first_hack.safetensors, computes the per-module lora2r delta (B@A - B0@A0), and adds it to the base weights (252 target Linears). No ground-truth rollout labels are used; the warmup teacher demos are off-distribution.

Model tree for wassname/vgrout-bootstrap-firsthack-s43

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

(724)

this model

wassname
/

vgrout-bootstrap-firsthack-s43

vGROUT first-hack bootstrap (Qwen3-4B, seed 43)

Why this checkpoint

How it was made

Links

Model tree for wassname/vgrout-bootstrap-firsthack-s43