Content-based Controls For Music Large Language Modeling

Liwei Lin,Gus Xia,Junyan Jiang,Yixiao Zhang

2024-10-07

Abstract:Recent years have witnessed a rapid growth of large-scale language models in the domain of music audio. Such models enable end-to-end generation of higher-quality music, and some allow conditioned generation using text descriptions. However, the control power of text controls on music is intrinsically limited, as they can only describe music indirectly through meta-data (such as singers and instruments) or high-level representations (such as genre and emotion). We aim to further equip the models with direct and content-based controls on innate music languages such as pitch, chords and drum track. To this end, we contribute Coco-Mulla, a content-based control method for music large language modeling. It uses a parameter-efficient fine-tuning (PEFT) method tailored for Transformer-based audio models. Experiments show that our approach achieved high-quality music generation with low-resource semi-supervised learning, tuning with less than 4% parameters compared to the original model and training on a small dataset with fewer than 300 songs. Moreover, our approach enables effective content-based controls, and we illustrate the control power via chords and rhythms, two of the most salient features of music audio. Furthermore, we show that by combining content-based controls and text descriptions, our system achieves flexible music variation generation and arrangement. Our source codes and demos are available online.

Artificial Intelligence,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The problem this paper attempts to address is the controllability issue in music generation, particularly the ability to control complex elements in the musical language (such as chord progressions) and directly reference musical content from other audio recordings. Existing music generation models primarily rely on textual descriptions (e.g., metadata like singer, instrument, or high-level representations like music genre, emotion), which have limitations in expressing musical information. The goal of the paper is to introduce a content-based control method that enables the generation model to directly control fundamental musical language elements such as pitch, chords, and beats, thereby improving the quality and flexibility of music generation. Specifically, the paper proposes a content control method named Coco-Mulla, which employs a parameter-efficient fine-tuning technique (PEFT) that is particularly suitable for Transformer-based audio models. Experiments show that this method can achieve high-quality music generation in resource-constrained semi-supervised learning environments and effectively perform content-based control. Additionally, the paper demonstrates how combining content-based control with textual descriptions can achieve flexible music variant generation and arrangement.

Content-based Controls For Music Large Language Modeling

Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls

Music ControlNet: Multiple Time-varying Controls for Music Generation

Simple and Controllable Music Generation

Equipping Pretrained Unconditional Music Transformers with Instrument and Genre Controls

BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features

Improving Controllability and Editability for Pretrained Text-to-Music Generation Models

CoCoFormer: A controllable feature-rich polyphonic music generation method

MuseCoco: Generating Symbolic Music from Text

ByteComposer: a Human-like Melody Composition Method based on Language Model Agent

N-Gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding

MuPT: A Generative Symbolic Music Pretrained Transformer

ChatMusician: Understanding and Generating Music Intrinsically with LLM

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss

TunesFormer: Forming Tunes with Control Codes

Unlocking Potential in Pre-Trained Music Language Models for Versatile Multi-Track Music Arrangement

Flexible Control in Symbolic Music Generation via Musical Metadata

Mustango: Toward Controllable Text-to-Music Generation