Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
Paper • 2604.02986 • Published • 2
None defined yet.
Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests
How Can I Publish My LLM Benchmark Without Giving the True Answers Away?