ocar-grpo-observe-alfworld-7b — Archived Checkpoints

⚠️ Research line terminated (2026-04-22). These checkpoints are retained for inference / analysis reproducibility only. See the post-mortem document for why we do not recommend building on this method.

What this is

Fine-tuned from Qwen/Qwen2.5-7B-Instruct on ALFWorld with GRPO + observe (verl-agent stack), as part of the OCAR (Observation-grounded Credit Advantage Redistribution) research line investigating free policy-forward-pass signals for agent RL credit assignment.

Checkpoints (per-step revisions)

Each training step is stored as a separate git branch / revision. Load a specific step via revision=:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Ricardo-H/ocar-grpo-observe-alfworld-7b", revision="step_150", torch_dtype="bfloat16"
)
tokenizer = AutoTokenizer.from_pretrained("Ricardo-H/ocar-grpo-observe-alfworld-7b", revision="step_150")

Available revisions: step_150

Results summary

See ocar/docs/POSTMORTEM_SURPRISE.md in the companion repo for full results. Key points:

  • 6-seed peak SR (ALFWorld paper-config, t=0.4): around 80% — did not match GiGPO 90.8
  • Δs signal shown to be causally circular (reads back GRPO's own updates)
  • Step-level AUC ≈ 0.5 across 4 heterogeneous base scorers
  • Cross-environment direction flip on WebShop (r(Δs, succ): −0.53 ↔ +0.65)

Companion resources

Citation / attribution

These artifacts are shared in an "as-is" state. If you find the negative results useful, please reference the post-mortem document.

Downloads last month
20
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ricardo-H/ocar-grpo-observe-alfworld-7b

Base model

Qwen/Qwen2.5-7B
Finetuned
(3296)
this model

Collection including Ricardo-H/ocar-grpo-observe-alfworld-7b