Long-Form Text-to-Music Generation with Adaptive Prompts: A Case of Study in Tabletop Role-Playing Games Soundtracks

Felipe Marra,Lucas N. Ferreira
2024-11-06
Abstract:This paper investigates the capabilities of text-to-audio music generation models in producing long-form music with prompts that change over time, focusing on soundtrack generation for Tabletop Role-Playing Games (TRPGs). We introduce Babel Bardo, a system that uses Large Language Models (LLMs) to transform speech transcriptions into music descriptions for controlling a text-to-music model. Four versions of Babel Bardo were compared in two TRPG campaigns: a baseline using direct speech transcriptions, and three LLM-based versions with varying approaches to music description generation. Evaluations considered audio quality, story alignment, and transition smoothness. Results indicate that detailed music descriptions improve audio quality while maintaining consistency across consecutive descriptions enhances story alignment and transition smoothness.
Sound,Artificial Intelligence,Multimedia,Neural and Evolutionary Computing,Audio and Speech Processing
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: **Evaluating the performance of text - to - music generation models when generating long - duration music (more than 30 seconds), especially in the application of background music generation in tabletop role - playing games (TRPGs)**. Specifically, the author focuses on: 1. **Challenges in generating long - duration music**: Existing text - to - music generation models usually generate music clips within a short time (such as 30 seconds). However, when it is required to generate longer - duration music, how to ensure the quality, coherence, and consistency with the story scene of the music is a challenge. Especially in TRPGs, as the plot of the game changes, the music description also needs to be continuously updated. 2. **The role of adaptive prompts**: How to control the music generation process through dynamically changing prompts (that is, music descriptions generated in real - time according to the game plot), so that the generated music can better match the game scene and the transition between different scenes is smoother. To solve these problems, the author proposes a system named **Babel Bardo**, which combines large - language models (LLM) and text - to - music generation models to transcribe players' dialogues into music descriptions and generate corresponding background music. The paper conducts experiments with four different versions of the Babel Bardo system and evaluates its performance in two TRPG campaigns, focusing on the following aspects: - **Audio quality**: Use the Fréchet Audio Distance (FAD) metric to measure the quality of the generated music. - **Consistency with the story**: Use the Kullback - Leibler Divergence (KLD) metric to measure the similarity between the generated music and the original background music. - **Smoothness of transition**: Also use the KLD metric to measure whether the transition between music clips is smooth. The experimental results show that detailed music descriptions are helpful to improve audio quality, and maintaining the consistency between continuous descriptions is helpful to achieve smoother music transitions. In addition, emotional signals (such as happiness, calmness, excitement, suspense) play an important role in generating music that conforms to the TRPG plot.