Scalable Artificial Intelligence
Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference
An Empirical Study of Automating Agent Evaluation