Bidirectional Process Reward Model (Dream-7B, GSM8K)

LoRA adapter + reward head on Dream-org/Dream-v0-Instruct-7B for scoring partially-denoised dLLM intermediate states.

Details

Base: Dream-org/Dream-v0-Instruct-7B (frozen)
Adapter: LoRA r=16, α=32, on q_proj + v_proj
Reward head: MLP (hidden_size + 256 → 1024 → 1), mask-aware mean pool, sinusoidal step embedding
Attention: bidirectional
Training: 15,000 steps, effective batch 32, LR 1e-5 cosine, BCE, seed 42
Trainable params: ~9M

Held-out bucket accuracy

Mask ratio	Acc
0.0-0.1	0.918
0.5-0.6	0.818
0.9-1.0	0.578

Load

from transformers import AutoModel
from safetensors.torch import load_file
import torch

base = AutoModel.from_pretrained(
    "Dream-org/Dream-v0-Instruct-7B",
    trust_remote_code=True,
    attn_implementation="sdpa",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
state_dict = load_file("adapter.safetensors")
# Use DiffusionPRM from the companion code repo to reassemble.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AnonyRepo/bidir-prm-dream7b-gsm8k

Base model

Dream-org/Dream-v0-Instruct-7B

Adapter

(13)

this model

AnonyRepo
/

bidir-prm-dream7b-gsm8k

Bidirectional Process Reward Model (Dream-7B, GSM8K)

Details

Held-out bucket accuracy

Load

Model tree for AnonyRepo/bidir-prm-dream7b-gsm8k

Dataset used to train AnonyRepo/bidir-prm-dream7b-gsm8k