CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction
Abstract
A comprehensive ecosystem for music reward modeling under compositional multimodal instruction is established through large-scale datasets, unified benchmark, and parameter-efficient reward models that correlate well with human judgments.
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.
Community
Arxiv paper https://arxiv.org/abs/2603.00610
Pre-training dataset https://huggingface.co/datasets/HaiwenXia/cmi-pref-pseudo
finetuning dataset https://huggingface.co/datasets/HaiwenXia/cmi-pref
benchmark code https://github.com/Haiwen-Xia/CMI-RewardBench
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis (2026)
- AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering (2026)
- RM-Distiller: Exploiting Generative LLM for Reward Model Distillation (2026)
- Rewarding Creativity: A Human-Aligned Generative Reward Model for Reinforcement Learning in Storytelling (2026)
- MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio (2026)
- Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability (2026)
- Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper