The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Ilaria Manco,Benno Weck,SeungHeon Doh,Minz Won,Yixiao Zhang,Dmitry Bogdanov,Yusong Wu,Ke Chen,Philip Tovstogan,Emmanouil Benetos,Elio Quinton,György Fazekas,Juhan Nam
2023-11-23
Abstract:We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-and-language models. The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models on three key music-and-language tasks (music captioning, text-to-music generation and music-language retrieval). Our experiments highlight the importance of cross-dataset evaluation and offer insights into how researchers can use SDD to gain a broader understanding of model performance.
Sound,Artificial Intelligence,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve the problem of the lack of publicly available, high - quality paired audio and text datasets in music and language model evaluation. Specifically, the author points out that the current evaluation of music and language (M&L) models faces the following challenges: 1. **Lack of public and accessible datasets**: Many existing studies rely on private data, which leads to inconsistent evaluation practices. 2. **Insufficient dataset size and diversity**: Existing datasets such as MusicCaps and YT8M - MusicTextClips, although of a certain size, have short audio clips that are not directly accessible, affecting the comprehensiveness and reliability of model evaluation. 3. **Risk of over - fitting to specific datasets**: Researchers tend to use specific datasets for training and evaluation, which may lead to over - estimation of model performance. To solve these problems, the author introduces a new dataset - **Song Describer Dataset (SDD)**. SDD is a high - quality audio - text pair corpus generated by crowdsourcing and is designed for the evaluation of music and language models. This dataset contains 1,106 manually - written natural - language descriptions, covering 706 music recordings, all of which are publicly licensed and freely accessible. ### Main features of SDD: - **Longer audio clips**: 95% of the audio clips are 2 minutes long, providing more abundant music information. - **Publicly - licensed audio**: All audio comes from the Creative Commons - licensed MTG - Jamendo dataset, ensuring data persistence and accessibility. - **Diverse annotators**: The annotators of SDD come from different backgrounds, including non - professionals, and are more representative. - **Multi - annotation support**: Some recordings have descriptions from as many as five different annotators, which are suitable for automatic evaluation metrics. Through SDD, the author hopes to promote cross - dataset evaluation, provide a standardized comparison benchmark, and promote the research and development of music and language models. In addition, SDD can also help researchers better understand the performance of models on real - world data and avoid the problem of over - fitting to specific datasets. ### Main tasks and evaluation: To demonstrate the use of SDD, the author benchmarked popular models on three key tasks: 1. **Music caption generation**: Generate natural - language descriptions based on music. 2. **Text - to - music generation**: Synthesize music based on text prompts. 3. **Music - text retrieval**: Retrieve corresponding music items based on text queries. Through these experiments, the author emphasizes the importance of cross - dataset evaluation and provides insights into how researchers can use SDD to gain a broader understanding of model performance.