Simple and Controllable Music Generation

Jade Copet,Felix Kreuk,Itai Gat,Tal Remez,David Kant,Gabriel Synnaeve,Yossi Adi,Alexandre Défossez

2024-01-30

Abstract:We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, both mono and stereo, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at <a class="link-external link-https" href="https://github.com/facebookresearch/audiocraft" rel="external noopener nofollow">this https URL</a>

Sound,Artificial Intelligence,Machine Learning,Audio and Speech Processing

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? The main goal of this paper is to propose a simple and controllable music generation model, MUSIC GEN, which can generate high-quality music based on text descriptions. Specifically, the paper addresses the following key issues: 1. **Multi-stream audio representation**: - Existing multi-stream audio representation methods require multiple cascaded models (e.g., hierarchical or upsampling), whereas MUSIC GEN eliminates this need through a single-stage transformer language model and an efficient codebook interleaving scheme. 2. **High-fidelity audio generation**: - MUSIC GEN is capable of generating high-fidelity music samples at a 32 kHz sampling rate while maintaining high fidelity to the text descriptions. 3. **Controllable generation**: - The paper introduces an unsupervised melody conditioning mechanism, allowing the generated music to match given harmony and melody structures, thereby enhancing the controllability of the generation process. 4. **Stereo audio support**: - By extending the codebook interleaving scheme, MUSIC GEN can generate stereo music at a lower computational cost. 5. **Comprehensive evaluation**: - Through extensive automatic and human evaluations, the paper demonstrates that MUSIC GEN outperforms existing baseline models on standard text-to-music benchmarks and conducts detailed ablation studies to reveal the importance of each component.

Simple and Controllable Music Generation

MusicLM: Generating Music From Text

Audio Conditioning for Music Generation via Discrete Bottleneck Features

Efficient Neural Music Generation

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Multi-Source Music Generation with Latent Diffusion

Generating Stereophonic Music with Single-Stage Language Models.

Content-based Controls For Music Large Language Modeling

AudioLM: a Language Modeling Approach to Audio Generation

StemGen: A music generation model that listens

Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls

GETMusic: Generating Any Music Tracks with a Unified Representation and Diffusion Framework

Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features

Conditioning Deep Generative Raw Audio Models for Structured Automatic Music

Continuous Melody Generation via Disentangled Short-Term Representations and Structural Conditions

M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

Controllable Music Production with Diffusion Models and Guidance Gradients

Musika! Fast Infinite Waveform Music Generation

Music Generation System for Adversarial Training Based on Deep Learning