Update README.md
Browse files
README.md
CHANGED
|
@@ -10,7 +10,10 @@ language:
|
|
| 10 |
|
| 11 |
链接:[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1))
|
| 12 |
|
| 13 |
-
特别鸣谢:
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.
|
| 16 |
|
|
|
|
| 10 |
|
| 11 |
链接:[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1))
|
| 12 |
|
| 13 |
+
特别鸣谢:
|
| 14 |
+
|
| 15 |
+
- 1. 感谢 第一作者花钱请 某论文辅导机构进行了全面辅导。虽然花费了巨额资金,但是的确很值,无脑推荐!
|
| 16 |
+
- 2. 我们第二作者到第五作者基本没有贡献,但是非常开心能够直接署名。
|
| 17 |
|
| 18 |
Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.
|
| 19 |
|