Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

Tianze Xu,Jiajun Li,Xuesong Chen,Xinrui Yao,Shuchang Liu
2024-05-07
Abstract:In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the generation of music, images, and other forms of artistic expression across various industries. However, researches on general multi-modal music generation model remain scarce. To fill this gap, we propose a multi-modal music generation framework Mozart's Touch. It could generate aligned music with the cross-modality inputs, such as images, videos and text. Mozart's Touch is composed of three main components: Multi-modal Captioning Module, Large Language Model (LLM) Understanding & Bridging Module, and Music Generation Module. Unlike traditional approaches, Mozart's Touch requires no training or fine-tuning pre-trained models, offering efficiency and transparency through clear, interpretable prompts. We also introduce "LLM-Bridge" method to resolve the heterogeneous representation problems between descriptive texts of different modalities. We conduct a series of objective and subjective evaluations on the proposed model, and results indicate that our model surpasses the performance of current state-of-the-art models. Our codes and examples is availble at:
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the limitations of multimodal music generation models in converting images or videos into music, particularly in capturing the emotional atmosphere of visual inputs. To tackle this issue, the authors propose a lightweight multimodal music generation framework named "Mozart’s Touch." Mozart’s Touch mainly consists of three modules: 1. **Multi-modal Captioning Module**: Responsible for encoding and understanding user-input images and videos, and generating descriptive text. 2. **LLM Understanding & Bridging Module**: Utilizes large language models (LLMs) to convert multimodal descriptive text into prompts required for music generation, ensuring that the generated music better reflects the emotions and themes of the input visual elements. 3. **Music Generation Module**: Generates music based on the pre-trained MusicGen model. Through the collaborative work of these three modules, Mozart’s Touch can efficiently generate music that matches the input visual content without the need to retrain or fine-tune the pre-trained models. Additionally, the framework introduces the "LLM-Bridge" method to address the heterogeneous representation issues between different modal descriptive texts, thereby improving the quality and relevance of the generated music. Experimental results show that Mozart’s Touch outperforms existing state-of-the-art models in both image-to-music and video-to-music generation tasks. This indicates that the framework has significant advantages, particularly in enhancing the relevance and consistency of the generated music.