xychen123 commited on
Commit
abf10ea
·
verified ·
1 Parent(s): 0346e22

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -1
README.md CHANGED
@@ -10,7 +10,10 @@ language:
10
 
11
  链接:[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1))
12
 
13
- 特别鸣谢:感谢 某论文辅导机构对我们的全面辅导,没有他们就没有这篇文章。(虽然花费了资金,但是的确很值,无脑推荐!)
 
 
 
14
 
15
  Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.
16
 
 
10
 
11
  链接:[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1))
12
 
13
+ 特别鸣谢:
14
+
15
+ - 1. 感谢 第一作者花钱请 某论文辅导机构进行了全面辅导。虽然花费了巨额资金,但是的确很值,无脑推荐!
16
+ - 2. 我们第二作者到第五作者基本没有贡献,但是非常开心能够直接署名。
17
 
18
  Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.
19