Simple and Controllable Music Generation

Jade Copet,Felix Kreuk,Itai Gat,Tal Remez,David Kant,Gabriel Synnaeve,Yossi Adi,Alexandre Défossez
2024-01-30
Abstract:We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, both mono and stereo, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at <a class="link-external link-https" href="https://github.com/facebookresearch/audiocraft" rel="external noopener nofollow">this https URL</a>
Sound,Artificial Intelligence,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? The main goal of this paper is to propose a simple and controllable music generation model, MUSIC GEN, which can generate high-quality music based on text descriptions. Specifically, the paper addresses the following key issues: 1. **Multi-stream audio representation**: - Existing multi-stream audio representation methods require multiple cascaded models (e.g., hierarchical or upsampling), whereas MUSIC GEN eliminates this need through a single-stage transformer language model and an efficient codebook interleaving scheme. 2. **High-fidelity audio generation**: - MUSIC GEN is capable of generating high-fidelity music samples at a 32 kHz sampling rate while maintaining high fidelity to the text descriptions. 3. **Controllable generation**: - The paper introduces an unsupervised melody conditioning mechanism, allowing the generated music to match given harmony and melody structures, thereby enhancing the controllability of the generation process. 4. **Stereo audio support**: - By extending the codebook interleaving scheme, MUSIC GEN can generate stereo music at a lower computational cost. 5. **Comprehensive evaluation**: - Through extensive automatic and human evaluations, the paper demonstrates that MUSIC GEN outperforms existing baseline models on standard text-to-music benchmarks and conducts detailed ablation studies to reveal the importance of each component.