ocar-grpo-observe-alfworld-7b — Archived Checkpoints

⚠️ Research line terminated (2026-04-22). These checkpoints are retained for inference / analysis reproducibility only. See the post-mortem document for why we do not recommend building on this method.

What this is

Fine-tuned from Qwen/Qwen2.5-7B-Instruct on ALFWorld with GRPO + observe (verl-agent stack), as part of the OCAR (Observation-grounded Credit Advantage Redistribution) research line investigating free policy-forward-pass signals for agent RL credit assignment.

Checkpoints (per-step revisions)

Each training step is stored as a separate git branch / revision. Load a specific step via revision=:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Ricardo-H/ocar-grpo-observe-alfworld-7b", revision="step_150", torch_dtype="bfloat16"
)
tokenizer = AutoTokenizer.from_pretrained("Ricardo-H/ocar-grpo-observe-alfworld-7b", revision="step_150")

Available revisions: step_150

Results summary

See ocar/docs/POSTMORTEM_SURPRISE.md in the companion repo for full results. Key points:

6-seed peak SR (ALFWorld paper-config, t=0.4): around 80% — did not match GiGPO 90.8
Δs signal shown to be causally circular (reads back GRPO's own updates)
Step-level AUC ≈ 0.5 across 4 heterogeneous base scorers
Cross-environment direction flip on WebShop (r(Δs, succ): −0.53 ↔ +0.65)