MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Yun-Han Lan,Wen-Yi Hsiao,Hao-Chung Cheng,Yi-Hsuan Yang
2024-07-21
Abstract:Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted rhythm and chords as the condition signal. During inference, the condition can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets -- one derived from extracted features and the other from user-created inputs -- demonstrates that MusiConGen can generate realistic backing track music that aligns well with the specified conditions. We open-source the code and model checkpoints, and provide audio examples online, <a class="link-external link-https" href="https://musicongen.github.io/musicongen_demo/" rel="external noopener nofollow">this https URL</a>.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address a problem in text-to-music generation models: while existing models can produce high-quality and diverse audio, they cannot precisely control the temporal features of the generated music, such as chords and rhythm, based solely on text prompts. To tackle this challenge, the authors propose the MusiConGen model, a Transformer-based time-conditioned text-to-music generation model built on the pre-trained MusicGen framework. The main innovation of MusiConGen lies in an efficient fine-tuning mechanism optimized for consumer-grade GPUs, integrating automatically extracted rhythm and chord as conditional signals. During inference, these conditions can either be musical features extracted from reference audio signals or user-defined symbolic chord sequences, BPM (beats per minute), and text prompts. Performance evaluations on two datasets—one derived from extracted features and the other based on user-created inputs—demonstrate that MusiConGen can generate realistic accompaniment music highly consistent with the specified conditions. Additionally, the authors have open-sourced the code and model checkpoints and provided audio examples online. In summary, the paper addresses the issue of existing text-to-music generation models' inability to precisely control chords and rhythm in the generated music. It proposes a new model, MusiConGen, which better handles these temporal features, thereby generating music that more closely meets the requirements.