Abstract:The rapidly evolving multimodal Large Language Models (LLMs) urgently require new benchmarks to uniformly evaluate their performance on understanding and textually describing music. However, due to semantic gaps between Music Information Retrieval (MIR) algorithms and human understanding, discrepancies between professionals and the public, and low precision of annotations, existing music description datasets cannot serve as benchmarks. To this end, we present MuChin, the first open-source music description benchmark in Chinese colloquial language, designed to evaluate the performance of multimodal LLMs in understanding and describing music. We established the Caichong Music Annotation Platform (CaiMAP) that employs an innovative multi-person, multi-stage assurance method, and recruited both amateurs and professionals to ensure the precision of annotations and alignment with popular semantics. Utilizing this method, we built a dataset with multi-dimensional, high-precision music annotations, the Caichong Music Dataset (CaiMD), and carefully selected 1,000 high-quality entries to serve as the test set for MuChin. Based on MuChin, we analyzed the discrepancies between professionals and amateurs in terms of music description, and empirically demonstrated the effectiveness of annotated data for fine-tuning LLMs. Ultimately, we employed MuChin to evaluate existing music understanding models on their ability to provide colloquial descriptions of music. All data related to the benchmark, along with the scoring code and detailed appendices, have been open-sourced (<a class="link-external link-https" href="https://github.com/CarlWangChina/MuChin/" rel="external noopener nofollow">this https URL</a>).

OpenMU: Your Swiss Army Knife for Music Understanding

OpenMU: Your Swiss Army Knife for Music Understanding

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music

MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language Models

MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning

MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization

MuPT: A Generative Symbolic Music Pretrained Transformer

MuLan: A Joint Embedding of Music Audio and Natural Language

A Survey of Foundation Models for Music Understanding

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

WikiMuTe: A web-sourced dataset of semantic descriptions for music audio

Evaluation of pretrained language models on music understanding

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

ChatMusician: Understanding and Generating Music Intrinsically with LLM

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding