Mustango: Toward Controllable Text-to-Music Generation

Jan Melechovsky,Zixun Guo,Deepanway Ghosal,Navonil Majumder,Dorien Herremans,Soujanya Poria

2024-06-03

Abstract:The quality of the text-to-music models has reached new heights due to recent advancements in diffusion models. The controllability of various musical aspects, however, has barely been explored. In this paper, we propose Mustango: a music-domain-knowledge-inspired text-to-music system based on diffusion. Mustango aims to control the generated music, not only with general text captions, but with more rich captions that can include specific instructions related to chords, beats, tempo, and key. At the core of Mustango is MuNet, a Music-Domain-Knowledge-Informed UNet guidance module that steers the generated music to include the music-specific conditions, which we predict from the text prompt, as well as the general text embedding, during the reverse diffusion process. To overcome the limited availability of open datasets of music with text captions, we propose a novel data augmentation method that includes altering the harmonic, rhythmic, and dynamic aspects of music audio and using state-of-the-art Music Information Retrieval methods to extract the music features which will then be appended to the existing descriptions in text format. We release the resulting MusicBench dataset which contains over 52K instances and includes music-theory-based descriptions in the caption text. Through extensive experiments, we show that the quality of the music generated by Mustango is state-of-the-art, and the controllability through music-specific text prompts greatly outperforms other models such as MusicGen and AudioLDM2.

Audio and Speech Processing

What problem does this paper attempt to address?

The paper mainly aims to address the following two core issues: 1. **Improving the quality and controllability of music generation**: Although existing Text-to-Music models have made significant progress in the quality of generated music, the controllability over various specific aspects of music (such as rhythm, chord progression, tempo, tonality, etc.) has been less explored. Therefore, the researchers proposed a new system called Mustango, which aims to control the generated music through richer text descriptions, including specific musical instructions. 2. **Addressing the limitation of training data**: High-quality music datasets with text descriptions are relatively scarce, which limits the development of existing Text-to-Music models. To solve this problem, the researchers developed a new dataset called MusicBench, which increases the quantity and diversity of data through enhancement and diversification of existing data. In short, the goal of this paper is to achieve higher quality and more controllable Text-to-Music generation by proposing the Mustango system and its accompanying MusicBench dataset.

Mustango: Toward Controllable Text-to-Music Generation

QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion

Musecoco: Generating symbolic music from text

Melody-Guided Music Generation

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Music ControlNet: Multiple Time-varying Controls for Music Generation

Text2midi: Generating Symbolic Music from Captions

MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies

Noise2Music: Text-conditioned Music Generation with Diffusion Models

MeloTrans: A Text to Symbolic Music Generation Model Following Human Composition Habit

Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

Content-based Controls For Music Large Language Modeling

Improving Controllability and Editability for Pretrained Text-to-Music Generation Models

MusicLM: Generating Music From Text

UniMuMo: Unified Text, Music and Motion Generation

Simple and Controllable Music Generation

JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models