Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

Shansong Liu,Atin Sakkeer Hussain,Chenshuo Sun,Ying Shan

2023-08-22

Abstract:Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. Our model utilizes audio representations from a pretrained MERT model to extract music features. However, obtaining a suitable dataset for training the MU-LLaMA model remains challenging, as existing publicly accessible audio question answering datasets lack the necessary depth for open-ended music question answering. To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions. The experiments demonstrate that the proposed MU-LLaMA model, trained on our designed MusicQA dataset, achieves outstanding performance in both music question answering and music caption generation across various metrics, outperforming current state-of-the-art (SOTA) models in both fields and offering a promising advancement in the T2M-Gen research field.

Sound,Artificial Intelligence,Computation and Language,Multimedia,Audio and Speech Processing

What problem does this paper attempt to address?

The problem this paper attempts to address is the lack of large-scale publicly available music datasets with natural language descriptions in the field of Text-to-Music Generation (T2M-Gen). Specifically, existing music datasets often lack sufficient descriptive tags or annotations, which limits the performance improvement of models in music understanding and generation tasks. To overcome this obstacle, the authors propose a model named Music Understanding LLaMA (MU-LLaMA), which can answer music-related questions and generate descriptions of music files. Additionally, the authors propose a method for generating music question-answer pairs and construct a dataset named MusicQA, specifically for training the MU-LLaMA model to improve its performance in music question-answering and music description generation tasks. Experimental validation shows that the MU-LLaMA model performs excellently in both tasks, surpassing the current state-of-the-art models.

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

MUSIC QUESTION ANSWERING:COGNIZE AND PERCEIVE MUSIC

Generating Symbolic Music from Natural Language Prompts using an LLM-Enhanced Dataset

OpenMU: Your Swiss Army Knife for Music Understanding

MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization

QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

MuLan: A Joint Embedding of Music Audio and Natural Language

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

LP-MusicCaps: LLM-Based Pseudo Music Captioning

Melody-Guided Music Generation

ChatMusician: Understanding and Generating Music Intrinsically with LLM

Enriching Music Descriptions with a Finetuned-LLM and Metadata for Text-to-Music Retrieval

Joint Music and Language Attention Models for Zero-shot Music Tagging

MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music

MuPT: A Generative Symbolic Music Pretrained Transformer

Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

Evaluation of pretrained language models on music understanding