Abstract:Singing is one of the most cherished forms of human entertainment. However, creating a beautiful song requires an accompaniment that complements the vocals and aligns well with the song instruments and genre. With advancements in deep learning, previous research has focused on generating suitable accompaniments but often lacks precise alignment with the desired instrumentation and genre. To address this, we propose a straightforward method that enables control over the accompaniment through text prompts, allowing the generation of music that complements the vocals and aligns with the song instrumental and genre requirements. Through extensive experiments, we successfully generate 10-second accompaniments using vocal input and text control.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to generate music accompaniment that highly matches user requirements. Specifically, although existing music - generation models can generate high - quality accompaniment, they lack precise control in aspects such as instrument selection, rhythm, and style, making it difficult to meet users' personalized needs. To solve this problem, the author proposes a new system named Llambada. This system controls accompaniment generation through text prompts, enabling users to more flexibly define the required accompaniment characteristics. ### Main Problems 1. **Lack of Fine - grained Control over Generated Accompaniment**: The accompaniment generated by existing methods is difficult to fully meet users' expectations in terms of instruments, rhythm, and style. 2. **Insufficient Datasets**: Datasets used to train music - generation models usually do not contain detailed text descriptions, which limits the learning ability of the models. ### Solutions - **Llambada System**: By introducing text prompts, users can specify details such as the required instruments, rhythm, and style, thereby generating accompaniment that better meets their needs. - **Pseudo - Caption Dataset Generation Pipeline**: To overcome the problem of insufficient datasets, the author proposes a pseudo - caption generation method to automatically generate music datasets with detailed descriptions to support model training. ### Technical Details - **Two - stage Generation Model**: - **Semantic Generation Stage**: Generate semantic tokens according to the input speech and text prompts. These tokens represent the overall structure and rhythm of the music. - **Coarse - grained Acoustic Generation Stage**: Generate the final audio waveform based on the semantic tokens and the coarse - grained tokens of the speech. - **Key Components**: - **MERT Model**: Extract the semantic information of the audio. - **CLAP Model**: Convert text prompts into discrete codes to align with audio features. - **Encodec Model**: Used for encoding and decoding audio signals to generate high - quality audio output. ### Experimental Results Through extensive experiments, the Llambada system performs excellently on multiple evaluation metrics, especially significantly outperforming existing methods in terms of audio quality and consistency under text - prompt control. The experiments include in - domain tests and out - of - domain tests, which verify the generalization ability and robustness of the model. ### Conclusion The Llambada system successfully solves the problem of the lack of fine - grained control in accompaniment generation in existing music - generation models by introducing text - prompt control, and provides new ideas and tools for future research. Although significant progress has been made, the author also points out directions for further improvement, such as optimizing the alignment between text prompts and music semantic features, and reducing the demand for computational resources.

Sing-On-Your-Beat: Simple Text-Controllable Accompaniment Generations

Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment

Accompanied Singing Voice Synthesis with Fully Text-controlled Melody

SingSong: Generating musical accompaniments from singing

A Deep Learning Based Analysis-Synthesis Framework For Unison Singing

FastSAG: Towards Fast Non-Autoregressive Singing Accompaniment Generation

Improving Controllability and Editability for Pretrained Text-to-Music Generation Models

A Melody-Unsupervision Model for Singing Voice Synthesis

Singing-Tacotron

SongCreator: Lyrics-based Universal Song Generation

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition

SongDriver: Real-time Music Accompaniment Generation without Logical Latency nor Exposure Bias

SingNet: A Real-time Singing Voice Beat and Downbeat Tracking System

Deep Audio-Visual Singing Voice Transcription based on Self-Supervised Learning Models

The ACCompanion: Combining Reactivity, Robustness, and Musical Expressivity in an Automatic Piano Accompanist

Synchronising speech segments with musical beats in Mandarin and English singing

A Framework for Automated Pop-song Melody Generation with Piano Accompaniment Arrangement

SongMASS: Automatic Song Writing with Pre-training and Alignment Constraint

AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment

Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings