ishidalab/capcode
Viewer • Updated • 756 • 46
None defined yet.
Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests
How Can I Publish My LLM Benchmark Without Giving the True Answers Away?