Abstract:We introduce MusicFlow, a cascaded text-to-music generation model based on flow matching. Based on self-supervised representations to bridge between text descriptions and music audios, we construct two flow matching networks to model the conditional distribution of semantic and acoustic features. Additionally, we leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation in a zero-shot manner. Experiments on MusicCaps reveal that the music generated by MusicFlow exhibits superior quality and text coherence despite being over $2\sim5$ times smaller and requiring $5$ times fewer iterative steps. Simultaneously, the model can perform other music generation tasks and achieves competitive performance in music infilling and continuation. Our code and model will be publicly available.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to develop an efficient and multi - functional text - guided music generation model. Specifically, the authors propose a new method named MusicFlow, aiming to overcome the deficiencies of existing music generation models in the following aspects: 1. **Computational Efficiency and Model Size**: Existing methods based on language models or diffusion models usually require a large number of parameters and computational resources, resulting in large - scale models and slow inference speeds. 2. **Multi - task Generation Ability**: Many existing models focus on generating music from text (text - to - music, TTM), but lack the ability to handle other practical generation tasks (such as music continuation and filling). 3. **Capturing the Complex Dependencies between Text Descriptions and Music**: Methods that directly generate music audio from text often have difficulty capturing the complex dependencies between text descriptions and music segments. ### Main Contributions of MusicFlow - **Cascaded Flow - Matching Networks**: MusicFlow adopts two cascaded flow - matching networks, which are used to model semantic features and acoustic features respectively. In this way, the model can learn and generate music more efficiently while maintaining a relatively small model scale. - **Non - Autoregressive Training Objectives**: Using mask prediction as a training objective enables the model to perform multiple music generation tasks, such as music continuation and filling, in a zero - sample situation. - **Efficient Inference Process**: Compared with traditional autoregressive models and diffusion models, MusicFlow requires fewer iterative steps during the inference process, thereby improving the generation speed and efficiency. ### Experimental Results Experiments show that MusicFlow is not only comparable or even superior to existing methods in terms of generation quality, but also has significant improvements in model size and inference efficiency. In addition, MusicFlow can effectively handle tasks such as music continuation and filling, demonstrating its flexibility and superiority in multi - task generation. ### Formula Representation The formulas involved in the paper are represented in Markdown format as follows: - Ordinary Differential Equation (ODE) for Flow Matching: \[ \frac{d}{dt}\phi_t(x) = v_t(\phi_t(x)) \] where $ v_t: [0,1] \times \mathbb{R}^d \to \mathbb{R}^d $ is a time - dependent vector field parameterized by a neural network. - Conditional Flow - Matching Objective Function: \[ \left[ \|v_t(x; \theta) - u_t(x|x_1)\|^2 \right] \] These formulas ensure that the model can efficiently perform continuous transformations of probability densities during training and inference, thereby achieving high - quality music generation.

MusicFlow: Cascaded Flow Matching for Text Guided Music Generation

High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching

FLUX that Plays Music

Flow Matching Guide and Code

Flow Generator Matching

Generative Pre-training for Speech with Flow Matching

Guided Flows for Generative Modeling and Decision Making

Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion

TalkingFlow: Talking Facial Landmark Generation with Multi-Scale Normalizing Flow Network

LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation

Flow Matching in Latent Space

Wasserstein Flow Matching: Generative modeling over families of distributions

FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates

Discrete Flow Matching

FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

WaveFlow: A Compact Flow-based Model for Raw Audio

Fisher Flow Matching for Generative Modeling over Discrete Data

LayoutFlow: Flow Matching for Layout Generation

Flow Matching for Generative Modeling