ocar-grpo-observe-alfworld-1.5b — Archived Checkpoints
⚠️ Research line terminated (2026-04-22). These checkpoints are retained for inference / analysis reproducibility only. See the post-mortem document for why we do not recommend building on this method.
What this is
Fine-tuned from Qwen/Qwen2.5-1.5B-Instruct on ALFWorld with GRPO + observe (verl-agent stack),
as part of the OCAR (Observation-grounded Credit Advantage Redistribution)
research line investigating free policy-forward-pass signals for agent RL
credit assignment.
Checkpoints (per-step revisions)
Each training step is stored as a separate git branch / revision. Load a
specific step via revision=:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Ricardo-H/ocar-grpo-observe-alfworld-1.5b", revision="step_150", torch_dtype="bfloat16"
)
tokenizer = AutoTokenizer.from_pretrained("Ricardo-H/ocar-grpo-observe-alfworld-1.5b", revision="step_150")
Available revisions: step_20, step_40, step_60, step_80, step_100, step_120, step_140, step_150
Results summary
See ocar/docs/POSTMORTEM_SURPRISE.md in the companion repo for full results.
Key points:
- 6-seed peak SR (ALFWorld paper-config, t=0.4): around 80% — did not match GiGPO 90.8
- Δs signal shown to be causally circular (reads back GRPO's own updates)
- Step-level AUC ≈ 0.5 across 4 heterogeneous base scorers
- Cross-environment direction flip on WebShop (r(Δs, succ): −0.53 ↔ +0.65)
Companion resources
- Code & analysis: https://github.com/ymguan/verl-agent
- Training trajectories:
data/trajectories/in companion repo - Analysis JSONs:
ocar/analysis_results/in companion repo - Post-mortem:
ocar/docs/POSTMORTEM_SURPRISE.md
Citation / attribution
These artifacts are shared in an "as-is" state. If you find the negative results useful, please reference the post-mortem document.
- Downloads last month
- 24