Sing-On-Your-Beat: Simple Text-Controllable Accompaniment Generations

Quoc-Huy Trinh,Minh-Van Nguyen,Trong-Hieu Nguyen Mau,Khoa Tran,Thanh Do
2024-11-04
Abstract:Singing is one of the most cherished forms of human entertainment. However, creating a beautiful song requires an accompaniment that complements the vocals and aligns well with the song instruments and genre. With advancements in deep learning, previous research has focused on generating suitable accompaniments but often lacks precise alignment with the desired instrumentation and genre. To address this, we propose a straightforward method that enables control over the accompaniment through text prompts, allowing the generation of music that complements the vocals and aligns with the song instrumental and genre requirements. Through extensive experiments, we successfully generate 10-second accompaniments using vocal input and text control.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to generate music accompaniment that highly matches user requirements. Specifically, although existing music - generation models can generate high - quality accompaniment, they lack precise control in aspects such as instrument selection, rhythm, and style, making it difficult to meet users' personalized needs. To solve this problem, the author proposes a new system named Llambada. This system controls accompaniment generation through text prompts, enabling users to more flexibly define the required accompaniment characteristics. ### Main Problems 1. **Lack of Fine - grained Control over Generated Accompaniment**: The accompaniment generated by existing methods is difficult to fully meet users' expectations in terms of instruments, rhythm, and style. 2. **Insufficient Datasets**: Datasets used to train music - generation models usually do not contain detailed text descriptions, which limits the learning ability of the models. ### Solutions - **Llambada System**: By introducing text prompts, users can specify details such as the required instruments, rhythm, and style, thereby generating accompaniment that better meets their needs. - **Pseudo - Caption Dataset Generation Pipeline**: To overcome the problem of insufficient datasets, the author proposes a pseudo - caption generation method to automatically generate music datasets with detailed descriptions to support model training. ### Technical Details - **Two - stage Generation Model**: - **Semantic Generation Stage**: Generate semantic tokens according to the input speech and text prompts. These tokens represent the overall structure and rhythm of the music. - **Coarse - grained Acoustic Generation Stage**: Generate the final audio waveform based on the semantic tokens and the coarse - grained tokens of the speech. - **Key Components**: - **MERT Model**: Extract the semantic information of the audio. - **CLAP Model**: Convert text prompts into discrete codes to align with audio features. - **Encodec Model**: Used for encoding and decoding audio signals to generate high - quality audio output. ### Experimental Results Through extensive experiments, the Llambada system performs excellently on multiple evaluation metrics, especially significantly outperforming existing methods in terms of audio quality and consistency under text - prompt control. The experiments include in - domain tests and out - of - domain tests, which verify the generalization ability and robustness of the model. ### Conclusion The Llambada system successfully solves the problem of the lack of fine - grained control in accompaniment generation in existing music - generation models by introducing text - prompt control, and provides new ideas and tools for future research. Although significant progress has been made, the author also points out directions for further improvement, such as optimizing the alignment between text prompts and music semantic features, and reducing the demand for computational resources.