xychen123
/

LamPO

xychen123 commited on 2 days ago

Commit

abf10ea

verified ·

1 Parent(s): 0346e22

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -10,7 +10,10 @@ language:
 链接：[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1))
-特别鸣谢：感谢 某论文辅导机构对我们的全面辅导，没有他们就没有这篇文章。（虽然花费了资金，但是的确很值，无脑推荐！）
 Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.

 链接：[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1))
+特别鸣谢：
+- 1. 感谢 第一作者花钱请 某论文辅导机构进行了全面辅导。虽然花费了巨额资金，但是的确很值，无脑推荐！
+- 2. 我们第二作者到第五作者基本没有贡献，但是非常开心能够直接署名。
 Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.