Evaluating Large Language Models: A Comprehensive Survey Paper • 2310.19736 • Published Oct 30, 2023 • 2
M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models Paper • 2305.10263 • Published May 17, 2023
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos Paper • 2505.20124 • Published May 26, 2025 • 1
Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search Paper • 2506.11155 • Published Jun 11, 2025 • 1