How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="MinKeonKim/PRO-STEP-Policy-7B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("MinKeonKim/PRO-STEP-Policy-7B")
model = AutoModelForCausalLM.from_pretrained("MinKeonKim/PRO-STEP-Policy-7B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Quick Links

PRO-STEP: Step-level Process Reward Optimization for RAG (Policy Model)

This is the main policy model for PRO-STEP, a self-improving framework for agentic Retrieval-Augmented Generation. The policy is trained on its own MCTS trajectories scored by an open-source 8B PRM, using step-level DPO.

  • Backbone: Qwen2.5-7B-Instruct
  • PRM: DORAEMONG/PRO-STEP-PRM-8B
  • Preference data: DORAEMONG/PRO-STEP-Preference-Data
  • Training: DPO (β=0.1) with document-token masking, LoRA r=64/α=128, 1 epoch, 5,000 questions / 15,877 step-level preference pairs
  • MCTS: K=3 branching, depth 7, 64 rollouts/question, value V(s) = Q̄(s) + α · r̂(s) with α=0.3

Performance (5-dataset, identical FlashRAG eval pipeline)

Method Train data HotpotQA PopQA 2Wiki Bamboogle Musique AVG
Search-R1 ~90,000 37.88 / 49.56 40.65 / 46.78 34.87 / 42.50 33.60 / 43.55 12.99 / 21.23 32.00 / 40.72
ReasonRAG ~5,000 36.37 / 47.51 37.78 / 44.87 39.80 / 46.32 38.40 / 46.86 10.59 / 19.22 32.59 / 40.96
StepSearch ~19,000 38.72 / 50.67 39.24 / 44.97 40.38 / 47.12 33.60 / 44.16 13.82 / 23.06 33.15 / 42.00
PRO-STEP (ours) 5,000 38.73 / 51.63 40.47 / 47.37 44.07 / 51.43 36.80 / 47.63 12.49 / 22.41 34.51 / 44.09

EM / F1 (Strict EM, token-F1). Bootstrap 95% CI: vs Search-R1 +2.51 EM [+1.01, +4.06], vs ReasonRAG +1.93 EM [+0.46, +3.36].

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("DORAEMONG/PRO-STEP-Policy-7B", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("DORAEMONG/PRO-STEP-Policy-7B")

# Use with FlashRAG SearchR1Pipeline or any agentic-RAG inference loop
# System prompt: see paper Appendix A

Citation

@article{prostep2026,
  title={PRO-STEP: Step-level Process Reward Optimization for Retrieval-Augmented Generation},
  author={...},
  year={2026}
}
Downloads last month
21
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MinKeonKim/PRO-STEP-Policy-7B

Base model

Qwen/Qwen2.5-7B
Finetuned
(3347)
this model
Quantizations
2 models

Dataset used to train MinKeonKim/PRO-STEP-Policy-7B