MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Yun-Han Lan,Wen-Yi Hsiao,Hao-Chung Cheng,Yi-Hsuan Yang

2024-07-21

Abstract:Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted rhythm and chords as the condition signal. During inference, the condition can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets -- one derived from extracted features and the other from user-created inputs -- demonstrates that MusiConGen can generate realistic backing track music that aligns well with the specified conditions. We open-source the code and model checkpoints, and provide audio examples online, <a class="link-external link-https" href="https://musicongen.github.io/musicongen_demo/" rel="external noopener nofollow">this https URL</a>.

Sound,Artificial Intelligence,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address a problem in text-to-music generation models: while existing models can produce high-quality and diverse audio, they cannot precisely control the temporal features of the generated music, such as chords and rhythm, based solely on text prompts. To tackle this challenge, the authors propose the MusiConGen model, a Transformer-based time-conditioned text-to-music generation model built on the pre-trained MusicGen framework. The main innovation of MusiConGen lies in an efficient fine-tuning mechanism optimized for consumer-grade GPUs, integrating automatically extracted rhythm and chord as conditional signals. During inference, these conditions can either be musical features extracted from reference audio signals or user-defined symbolic chord sequences, BPM (beats per minute), and text prompts. Performance evaluations on two datasets—one derived from extracted features and the other based on user-created inputs—demonstrate that MusiConGen can generate realistic accompaniment music highly consistent with the specified conditions. Additionally, the authors have open-sourced the code and model checkpoints and provided audio examples online. In summary, the paper addresses the issue of existing text-to-music generation models' inability to precisely control chords and rhythm in the generated music. It proposes a new model, MusiConGen, which better handles these temporal features, thereby generating music that more closely meets the requirements.

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Simple and Controllable Music Generation

MusicGen-Chord: Advancing Music Generation through Chord Progressions and Interactive Web-UI

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

Musecoco: Generating symbolic music from text

Mustango: Toward Controllable Text-to-Music Generation

JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models

Music ControlNet: Multiple Time-varying Controls for Music Generation

PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network

Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion

MusicLM: Generating Music From Text

MeloTrans: A Text to Symbolic Music Generation Model Following Human Composition Habit

JEN-1 DreamStyler: Customized Musical Concept Learning via Pivotal Parameters Tuning

Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models

JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music Generation

BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features

Audio Conditioning for Music Generation via Discrete Bottleneck Features

Text2midi: Generating Symbolic Music from Captions

Content-based Controls For Music Large Language Modeling

StemGen: A music generation model that listens