TDRM Learning Smooth Reward Models with Temporal Difference for LLM RL and Inference zd21/DeepSeek-TD0-PRM Updated Jul 12, 2025 zd21/DeepSeek-TD2-PRM Updated Jul 12, 2025 zd21/DeepSeek-ScalarPRM Updated Jul 12, 2025 zd21/DeepSeek-ScalarORM Updated Jul 12, 2025
TDRM Learning Smooth Reward Models with Temporal Difference for LLM RL and Inference zd21/DeepSeek-TD0-PRM Updated Jul 12, 2025 zd21/DeepSeek-TD2-PRM Updated Jul 12, 2025 zd21/DeepSeek-ScalarPRM Updated Jul 12, 2025 zd21/DeepSeek-ScalarORM Updated Jul 12, 2025