Abstract:Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease. Our system is publicly available at \url{<a class="link-external link-https" href="https://github.com/AIGC-Audio/AudioGPT" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper proposes a multimodal artificial intelligence system named **AudioGPT**, aiming to address the following issues: 1. **Ability to handle complex audio information**: Existing large language models (LLMs) like ChatGPT perform well in text processing but have limited capabilities in handling speech, music, sounds, and talking head avatars. 2. **Support for spoken dialogue**: Current LLMs cannot engage in spoken dialogue like Siri or Alexa. 3. **Multimodal data processing**: Although there are some foundational models for audio processing, integrating these models with LLMs to support multimodal tasks remains challenging. ### Main Contributions 1. **Integration of audio foundational models**: Combining ChatGPT with various audio foundational models enables AudioGPT to handle complex audio tasks. 2. **Speech conversion interface**: Implementing speech recognition (ASR) and text-to-speech (TTS) interfaces to convert between speech and text, supporting spoken dialogue. 3. **Multimodal evaluation principles**: Proposing a set of principles for evaluating multimodal LLMs, including consistency, capability, and robustness, and validating AudioGPT's performance in these aspects through experiments. ### System Design The design of AudioGPT is divided into four main stages: 1. **Modality conversion**: Converting different input modalities (speech, text) into a consistent format. 2. **Task analysis**: Using a dialogue engine and prompt manager to parse user intent and generate structured task parameters. 3. **Model allocation**: Selecting appropriate audio foundational models based on task requirements. 4. **Response generation**: Generating the final response to return to the user. ### Experimental Results Experimental results show that AudioGPT performs excellently in understanding and generating tasks involving speech, music, sounds, and talking head avatars in multi-turn dialogues, helping users easily create rich and diverse audio content.

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

PodGPT: An audio-augmented large language model for research and education

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

PandaGPT: One Model to Instruction-Follow Them All.

UniAudio: Towards Universal Audio Generation with Large Language Models

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

AudioPaLM: A Large Language Model That Can Speak and Listen

Audiobox: Unified Audio Generation with Natural Language Prompts

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

AutoML-GPT: Automatic Machine Learning with GPT

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

LLM-AD: Large Language Model based Audio Description System

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

Roadmap towards Superhuman Speech Understanding using Large Language Models