WavJourney: Compositional Audio Creation with Large Language Models

Xubo Liu,Zhongkai Zhu,Haohe Liu,Yi Yuan,Meng Cui,Qiushi Huang,Jinhua Liang,Yin Cao,Qiuqiang Kong,Mark D. Plumbley,Wenwu Wang

2023-11-26

Abstract:Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. However, real-world audio creation aims to generate harmonious audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation systems. We present WavJourney, a novel framework that leverages Large Language Models (LLMs) to connect various audio models for audio creation. WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions. Specifically, given a text instruction, WavJourney first prompts LLMs to generate an audio script that serves as a structured semantic representation of audio elements. The audio script is then converted into a computer program, where each line of the program calls a task-specific audio generation model or computational operation function. The computer program is then executed to obtain a compositional and interpretable solution for audio creation. Experimental results suggest that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions, achieving state-of-the-art results on text-to-audio generation benchmarks. Additionally, we introduce a new multi-genre story benchmark. Subjective evaluations demonstrate the potential of WavJourney in crafting engaging storytelling audio content from text. We further demonstrate that WavJourney can facilitate human-machine co-creation in multi-round dialogues. To foster future research, the code and synthesized audio are available at: <a class="link-external link-https" href="https://audio-agi.github.io/WavJourney_demopage/" rel="external noopener nofollow">this https URL</a>.

Sound,Artificial Intelligence,Multimedia,Audio and Speech Processing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **automatically creating composite audio content containing multiple audio elements (such as voice, music, and sound effects)**. Existing audio generation models are usually limited to conditions in specific fields, such as speech transcription and audio captioning, while audio creation in the real world requires the generation of harmonious audio that contains various elements and these elements need to be combined under controllable conditions. This poses a challenge to existing audio generation systems. Therefore, the paper proposes the WavJourney framework, which uses large - language models (LLMs) to connect different audio models in order to create story - telling audio content containing diverse audio elements from text descriptions. WavJourney aims to address this challenge by understanding and generating structured audio scripts, which can be converted into computer programs and, when executed, can generate composite and interpretable audio solutions.

WavJourney: Compositional Audio Creation with Large Language Models

WavCraft: Audio Editing and Generation with Large Language Models

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model

AudioLM: a Language Modeling Approach to Audio Generation

Audiobox: Unified Audio Generation with Natural Language Prompts

AudioLCM: Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps

WavLLM: Towards Robust and Adaptive Speech Large Language Model

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

ByteComposer: a Human-like Melody Composition Method based on Language Model Agent

A Framework for Synthetic Audio Conversations Generation using Large Language Models

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation