Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

Shansong Liu,Atin Sakkeer Hussain,Chenshuo Sun,Ying Shan
2023-08-22
Abstract:Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. Our model utilizes audio representations from a pretrained MERT model to extract music features. However, obtaining a suitable dataset for training the MU-LLaMA model remains challenging, as existing publicly accessible audio question answering datasets lack the necessary depth for open-ended music question answering. To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions. The experiments demonstrate that the proposed MU-LLaMA model, trained on our designed MusicQA dataset, achieves outstanding performance in both music question answering and music caption generation across various metrics, outperforming current state-of-the-art (SOTA) models in both fields and offering a promising advancement in the T2M-Gen research field.
Sound,Artificial Intelligence,Computation and Language,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The problem this paper attempts to address is the lack of large-scale publicly available music datasets with natural language descriptions in the field of Text-to-Music Generation (T2M-Gen). Specifically, existing music datasets often lack sufficient descriptive tags or annotations, which limits the performance improvement of models in music understanding and generation tasks. To overcome this obstacle, the authors propose a model named Music Understanding LLaMA (MU-LLaMA), which can answer music-related questions and generate descriptions of music files. Additionally, the authors propose a method for generating music question-answer pairs and construct a dataset named MusicQA, specifically for training the MU-LLaMA model to improve its performance in music question-answering and music description generation tasks. Experimental validation shows that the MU-LLaMA model performs excellently in both tasks, surpassing the current state-of-the-art models.