M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

Shansong Liu,Atin Sakkeer Hussain,Chenshuo Sun,Ying Shan

2024-03-05

Abstract:The current landscape of research leveraging large language models (LLMs) is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. They also utilize LLMs to understand human intention and generate desired outputs like images, videos, and music. However, research that combines both understanding and generation using LLMs is still limited and in its nascent stage. To address this gap, we introduce a Multi-modal Music Understanding and Generation (M$^{2}$UGen) framework that integrates LLM's abilities to comprehend and generate music for different modalities. The M$^{2}$UGen framework is purpose-built to unlock creative potential from diverse sources of inspiration, encompassing music, image, and video through the use of pretrained MERT, ViT, and ViViT models, respectively. To enable music generation, we explore the use of AudioLDM 2 and MusicGen. Bridging multi-modal understanding and music generation is accomplished through the integration of the LLaMA 2 model. Furthermore, we make use of the MU-LLaMA model to generate extensive datasets that support text/image/video-to-music generation, facilitating the training of our M$^{2}$UGen framework. We conduct a thorough evaluation of our proposed framework. The experimental results demonstrate that our model achieves or surpasses the performance of the current state-of-the-art models.

Sound,Multimedia,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the problem of multimodal music understanding and generation, particularly in the context where research on utilizing large language models (LLMs) for music understanding and generation is still in its early stages. Specifically, the paper proposes a multimodal music understanding and generation framework called M2UGen, which integrates the capabilities of large language models to understand information from different modalities (such as text, images, and videos) and generate corresponding music. Additionally, the paper explores how to overcome the lack of training data by generating large multimodal music datasets and conducts a comprehensive evaluation of M2UGen, demonstrating its superior performance over current state-of-the-art models in various sub-tasks. Overall, the goal of the paper is to fill the gap in current research regarding multimodal music systems that combine understanding and generation capabilities.

M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

Video-driven musical composition using large language model with memory-augmented state space

UniAudio: Towards Universal Audio Generation with Large Language Models

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

MuPT: A Generative Symbolic Music Pretrained Transformer

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

ChatMusician: Understanding and Generating Music Intrinsically with LLM

Art2Mus: Bridging Visual Arts and Music through Cross-Modal Generation

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Simple and Controllable Music Generation

Multi-Source Music Generation with Latent Diffusion

M2M-Gen: A Multimodal Framework for Automated Background Music Generation in Japanese Manga Using Large Language Models

Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation

M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

LLMs Meet Multimodal Generation and Editing: A Survey

MGU-V: A Deep Learning Approach for Lo-Fi Music Generation Using Variational Autoencoders With State-of-the-Art Performance on Combined MIDI Datasets

Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond