Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

Tianze Xu,Jiajun Li,Xuesong Chen,Xinrui Yao,Shuchang Liu

2024-05-07

Abstract:In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the generation of music, images, and other forms of artistic expression across various industries. However, researches on general multi-modal music generation model remain scarce. To fill this gap, we propose a multi-modal music generation framework Mozart's Touch. It could generate aligned music with the cross-modality inputs, such as images, videos and text. Mozart's Touch is composed of three main components: Multi-modal Captioning Module, Large Language Model (LLM) Understanding & Bridging Module, and Music Generation Module. Unlike traditional approaches, Mozart's Touch requires no training or fine-tuning pre-trained models, offering efficiency and transparency through clear, interpretable prompts. We also introduce "LLM-Bridge" method to resolve the heterogeneous representation problems between descriptive texts of different modalities. We conduct a series of objective and subjective evaluations on the proposed model, and results indicate that our model surpasses the performance of current state-of-the-art models. Our codes and examples is availble at:

Sound,Artificial Intelligence,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the limitations of multimodal music generation models in converting images or videos into music, particularly in capturing the emotional atmosphere of visual inputs. To tackle this issue, the authors propose a lightweight multimodal music generation framework named "Mozart’s Touch." Mozart’s Touch mainly consists of three modules: 1. **Multi-modal Captioning Module**: Responsible for encoding and understanding user-input images and videos, and generating descriptive text. 2. **LLM Understanding & Bridging Module**: Utilizes large language models (LLMs) to convert multimodal descriptive text into prompts required for music generation, ensuring that the generated music better reflects the emotions and themes of the input visual elements. 3. **Music Generation Module**: Generates music based on the pre-trained MusicGen model. Through the collaborative work of these three modules, Mozart’s Touch can efficiently generate music that matches the input visual content without the need to retrain or fine-tune the pre-trained models. Additionally, the framework introduces the "LLM-Bridge" method to address the heterogeneous representation issues between different modal descriptive texts, thereby improving the quality and relevance of the generated music. Experimental results show that Mozart’s Touch outperforms existing state-of-the-art models in both image-to-music and video-to-music generation tasks. This indicates that the framework has significant advantages, particularly in enhancing the relevance and consistency of the generated music.

Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

ByteComposer: a Human-like Melody Composition Method based on Language Model Agent

Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

Deep Cross-Modal Audio-Visual Generation

Simple and Controllable Music Generation

Melody-Guided Music Generation

PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network

Art2Mus: Bridging Visual Arts and Music through Cross-Modal Generation

Multi-Modal Experience Inspired AI Creation

Multi-Source Music Generation with Latent Diffusion

Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model

Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Local deployment of large-scale music AI models on commodity hardware

Multi-Track Music Generation Network Based on a Hybrid Learning Module

MIDI-Sandwich: Multi-model Multi-task Hierarchical Conditional VAE-GAN networks for Symbolic Single-track Music Generation

Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls

Music Generation Using Dual Interactive Wasserstein Fourier Acquisitive Generative Adversarial Network

Content-based Controls For Music Large Language Modeling