Abstract:Diffusion models have shown promising results in cross-modal generation tasks involving audio and music, such as text-to-sound and text-to-music generation. These text-controlled music generation models typically focus on generating music by capturing global musical attributes like genre and mood. However, music composition is a complex, multilayered task that often involves musical arrangement as an integral part of the process. This process involves composing each instrument to align with existing ones in terms of beat, dynamics, harmony, and melody, requiring greater precision and control over tracks than text prompts usually provide. In this work, we address these challenges by extending the MusicLDM, a latent diffusion model for music, into a multi-track generative model. By learning the joint probability of tracks sharing a context, our model is capable of generating music across several tracks that correspond well to each other, either conditionally or unconditionally. Additionally, our model is capable of arrangement generation, where the model can generate any subset of tracks given the others (e.g., generating a piano track complementing given bass and drum tracks). We compared our model with an existing multi-track generative model and demonstrated that our model achieves considerable improvements across objective metrics for both total and arrangement generation tasks.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiencies of existing music generation models in multi - track music generation and arrangement generation. Specifically:
1. **Multi - track music generation**: Most of the existing music generation models focus on generating music through text prompts. These models can capture the overall properties of music such as style and emotion, but lack precise control over music details, especially the track correspondence when multiple instruments play simultaneously. Music creation is a complex and multi - level task, involving the coordination among multiple instrument tracks, such as the alignment of rhythm, dynamics, harmony and melody. Therefore, a model is needed that can maintain the coordinated consistency among multiple tracks when generating them.
2. **Arrangement generation**: Music arrangement refers to generating other tracks given some tracks to form a complete musical work. This task requires the model to be able to generate new tracks according to the existing tracks, making them coordinated with the existing tracks in time and frequency. The existing models have limited capabilities in this regard and cannot effectively generate high - quality arrangements.
To address these challenges, the paper proposes a multi - track music generation model based on the Latent Diffusion Model (LDM) - Multi - Track MusicLDM. By learning the joint probability distribution among tracks sharing context, this model can maintain the coordinated consistency among multiple tracks when generating them, and can also perform arrangement generation, that is, generate other tracks given some tracks.
### Main contributions
- **Multi - track generation**: The model can maintain the coordinated consistency among tracks when generating multiple tracks, and generate high - quality multi - track music.
- **Arrangement generation**: The model can generate other tracks given some tracks, achieving the generation of music arrangements.
- **Conditional generation**: The model supports music generation through text and audio conditions, improving the controllability and diversity of generation.
### Experimental results
- **Overall generation task**: In the unconditional generation mode, the Frechet Audio Distance (FAD) score of the model on the Slakh test dataset is significantly better than that of the baseline model MSDM, decreasing from 6.55 to 1.36.
- **Audio - conditional generation**: In the audio - conditional generation task, the model performs conditional generation through the CLAP audio branch, further improving the quality of the generated music, with the FAD score decreasing from 1.36 to 1.13.
- **Text - conditional generation**: In the text - conditional generation task, the model generates music through the CLAP text encoder and can generate music with different perceptual qualities according to different text prompts, verifying the generation ability of the model under text - conditional.
In conclusion, through the introduction of the Multi - Track MusicLDM model, this paper effectively solves the key problems in multi - track music generation and arrangement generation, providing a new solution for the field of music generation.