K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts Paper • 2606.02404 • Published 11 days ago • 56
LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training Paper • 2605.29888 • Published 15 days ago • 34
ResearchMath-14K: Scaling Research-Level Mathematics via Agents Paper • 2605.28003 • Published 16 days ago • 49
ResearchMath-14K: Scaling Research-Level Mathematics via Agents Paper • 2605.28003 • Published 16 days ago • 49
KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context Paper • 2604.13058 • Published Mar 18 • 2
What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models Paper • 2601.06165 • Published Jan 7 • 16
KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context Paper • 2604.13058 • Published Mar 18 • 2
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs Paper • 2605.09063 • Published May 9 • 80
XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity Paper • 2605.05662 • Published May 7 • 11
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs Paper • 2605.09063 • Published May 9 • 80