WavJourney: Compositional Audio Creation with Large Language Models

Xubo Liu,Zhongkai Zhu,Haohe Liu,Yi Yuan,Meng Cui,Qiushi Huang,Jinhua Liang,Yin Cao,Qiuqiang Kong,Mark D. Plumbley,Wenwu Wang
2023-11-26
Abstract:Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. However, real-world audio creation aims to generate harmonious audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation systems. We present WavJourney, a novel framework that leverages Large Language Models (LLMs) to connect various audio models for audio creation. WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions. Specifically, given a text instruction, WavJourney first prompts LLMs to generate an audio script that serves as a structured semantic representation of audio elements. The audio script is then converted into a computer program, where each line of the program calls a task-specific audio generation model or computational operation function. The computer program is then executed to obtain a compositional and interpretable solution for audio creation. Experimental results suggest that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions, achieving state-of-the-art results on text-to-audio generation benchmarks. Additionally, we introduce a new multi-genre story benchmark. Subjective evaluations demonstrate the potential of WavJourney in crafting engaging storytelling audio content from text. We further demonstrate that WavJourney can facilitate human-machine co-creation in multi-round dialogues. To foster future research, the code and synthesized audio are available at: <a class="link-external link-https" href="https://audio-agi.github.io/WavJourney_demopage/" rel="external noopener nofollow">this https URL</a>.
Sound,Artificial Intelligence,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **automatically creating composite audio content containing multiple audio elements (such as voice, music, and sound effects)**. Existing audio generation models are usually limited to conditions in specific fields, such as speech transcription and audio captioning, while audio creation in the real world requires the generation of harmonious audio that contains various elements and these elements need to be combined under controllable conditions. This poses a challenge to existing audio generation systems. Therefore, the paper proposes the WavJourney framework, which uses large - language models (LLMs) to connect different audio models in order to create story - telling audio content containing diverse audio elements from text descriptions. WavJourney aims to address this challenge by understanding and generating structured audio scripts, which can be converted into computer programs and, when executed, can generate composite and interpretable audio solutions.