MusicFlow: Cascaded Flow Matching for Text Guided Music Generation

K R Prajwal,Bowen Shi,Matthew Lee,Apoorv Vyas,Andros Tjandra,Mahi Luthra,Baishan Guo,Huiyu Wang,Triantafyllos Afouras,David Kant,Wei-Ning Hsu
2024-10-27
Abstract:We introduce MusicFlow, a cascaded text-to-music generation model based on flow matching. Based on self-supervised representations to bridge between text descriptions and music audios, we construct two flow matching networks to model the conditional distribution of semantic and acoustic features. Additionally, we leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation in a zero-shot manner. Experiments on MusicCaps reveal that the music generated by MusicFlow exhibits superior quality and text coherence despite being over $2\sim5$ times smaller and requiring $5$ times fewer iterative steps. Simultaneously, the model can perform other music generation tasks and achieves competitive performance in music infilling and continuation. Our code and model will be publicly available.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to develop an efficient and multi - functional text - guided music generation model. Specifically, the authors propose a new method named MusicFlow, aiming to overcome the deficiencies of existing music generation models in the following aspects: 1. **Computational Efficiency and Model Size**: Existing methods based on language models or diffusion models usually require a large number of parameters and computational resources, resulting in large - scale models and slow inference speeds. 2. **Multi - task Generation Ability**: Many existing models focus on generating music from text (text - to - music, TTM), but lack the ability to handle other practical generation tasks (such as music continuation and filling). 3. **Capturing the Complex Dependencies between Text Descriptions and Music**: Methods that directly generate music audio from text often have difficulty capturing the complex dependencies between text descriptions and music segments. ### Main Contributions of MusicFlow - **Cascaded Flow - Matching Networks**: MusicFlow adopts two cascaded flow - matching networks, which are used to model semantic features and acoustic features respectively. In this way, the model can learn and generate music more efficiently while maintaining a relatively small model scale. - **Non - Autoregressive Training Objectives**: Using mask prediction as a training objective enables the model to perform multiple music generation tasks, such as music continuation and filling, in a zero - sample situation. - **Efficient Inference Process**: Compared with traditional autoregressive models and diffusion models, MusicFlow requires fewer iterative steps during the inference process, thereby improving the generation speed and efficiency. ### Experimental Results Experiments show that MusicFlow is not only comparable or even superior to existing methods in terms of generation quality, but also has significant improvements in model size and inference efficiency. In addition, MusicFlow can effectively handle tasks such as music continuation and filling, demonstrating its flexibility and superiority in multi - task generation. ### Formula Representation The formulas involved in the paper are represented in Markdown format as follows: - Ordinary Differential Equation (ODE) for Flow Matching: \[ \frac{d}{dt}\phi_t(x) = v_t(\phi_t(x)) \] where \( v_t: [0,1] \times \mathbb{R}^d \to \mathbb{R}^d \) is a time - dependent vector field parameterized by a neural network. - Conditional Flow - Matching Objective Function: \[ \left[ \|v_t(x; \theta) - u_t(x|x_1)\|^2 \right] \] These formulas ensure that the model can efficiently perform continuous transformations of probability densities during training and inference, thereby achieving high - quality music generation.