TextWorld GRPO BehR-only (facts=0, len_penalty=0). Base: Textworld-Qwen2.5-7B. exponential reward, lr=5e-6, KL=0.001, n=5, T=1.3, 8xA100.
YOULING HUANG
Ricardo-H
·
AI & ML interests
None yet
Recent Activity
updated
a collection
1 minute ago
tw-wm-0301 updated
a model 1 minute ago
Ricardo-H/ws-wm-0301-step-240 published
a model 2 minutes ago
Ricardo-H/ws-wm-0301-step-240 Organizations
None yet