AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Rongjie Huang,Mingze Li,Dongchao Yang,Jiatong Shi,Xuankai Chang,Zhenhui Ye,Yuning Wu,Zhiqing Hong,Jiawei Huang,Jinglin Liu,Yi Ren,Zhou Zhao,Shinji Watanabe
2023-04-26
Abstract:Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease. Our system is publicly available at \url{<a class="link-external link-https" href="https://github.com/AIGC-Audio/AudioGPT" rel="external noopener nofollow">this https URL</a>}.
Computation and Language,Artificial Intelligence,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper proposes a multimodal artificial intelligence system named **AudioGPT**, aiming to address the following issues: 1. **Ability to handle complex audio information**: Existing large language models (LLMs) like ChatGPT perform well in text processing but have limited capabilities in handling speech, music, sounds, and talking head avatars. 2. **Support for spoken dialogue**: Current LLMs cannot engage in spoken dialogue like Siri or Alexa. 3. **Multimodal data processing**: Although there are some foundational models for audio processing, integrating these models with LLMs to support multimodal tasks remains challenging. ### Main Contributions 1. **Integration of audio foundational models**: Combining ChatGPT with various audio foundational models enables AudioGPT to handle complex audio tasks. 2. **Speech conversion interface**: Implementing speech recognition (ASR) and text-to-speech (TTS) interfaces to convert between speech and text, supporting spoken dialogue. 3. **Multimodal evaluation principles**: Proposing a set of principles for evaluating multimodal LLMs, including consistency, capability, and robustness, and validating AudioGPT's performance in these aspects through experiments. ### System Design The design of AudioGPT is divided into four main stages: 1. **Modality conversion**: Converting different input modalities (speech, text) into a consistent format. 2. **Task analysis**: Using a dialogue engine and prompt manager to parse user intent and generate structured task parameters. 3. **Model allocation**: Selecting appropriate audio foundational models based on task requirements. 4. **Response generation**: Generating the final response to return to the user. ### Experimental Results Experimental results show that AudioGPT performs excellently in understanding and generating tasks involving speech, music, sounds, and talking head avatars in multi-turn dialogues, helping users easily create rich and diverse audio content.