Bidirectional Process Reward Model (Dream-7B, GSM8K)

LoRA adapter + reward head on Dream-org/Dream-v0-Instruct-7B for scoring partially-denoised dLLM intermediate states.

Details

  • Base: Dream-org/Dream-v0-Instruct-7B (frozen)
  • Adapter: LoRA r=16, α=32, on q_proj + v_proj
  • Reward head: MLP (hidden_size + 256 → 1024 → 1), mask-aware mean pool, sinusoidal step embedding
  • Attention: bidirectional
  • Training: 15,000 steps, effective batch 32, LR 1e-5 cosine, BCE, seed 42
  • Trainable params: ~9M

Held-out bucket accuracy

Mask ratio Acc
0.0-0.1 0.918
0.5-0.6 0.818
0.9-1.0 0.578

Load

from transformers import AutoModel
from safetensors.torch import load_file
import torch

base = AutoModel.from_pretrained(
    "Dream-org/Dream-v0-Instruct-7B",
    trust_remote_code=True,
    attn_implementation="sdpa",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
state_dict = load_file("adapter.safetensors")
# Use DiffusionPRM from the companion code repo to reassemble.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AnonyRepo/bidir-prm-dream7b-gsm8k

Adapter
(13)
this model

Dataset used to train AnonyRepo/bidir-prm-dream7b-gsm8k