LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

Zhihao Du,Jiaming Wang,Qian Chen,Yunfei Chu,Zhifu Gao,Zerui Li,Kai Hu,Xiaohuan Zhou,Jin Xu,Ziyang Ma,Wen Wang,Siqi Zheng,Chang Zhou,Zhijie Yan,Shiliang Zhang

2024-07-03

Abstract:Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks, and have shown great potential as backbones for audio-and-text large language models (LLMs). Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features. In this paper, we propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation. LauraGPT is a versatile LLM that can process both audio and text inputs and generate outputs in either modalities. We propose a novel data representation that combines continuous and discrete features for audio: LauraGPT encodes input audio into continuous representations using an audio encoder and generates output audio from discrete codec codes. We propose a one-step codec vocoder to overcome the prediction challenge caused by the multimodal distribution of codec tokens. We fine-tune LauraGPT using supervised multi-task learning. Extensive experiments show that LauraGPT consistently achieves comparable to superior performance compared to strong baselines on a wide range of audio tasks related to content, semantics, paralinguistics, and audio-signal analysis, such as automatic speech recognition, speech-to-text translation, text-to-speech synthesis, speech enhancement, automated audio captioning, speech emotion recognition, and spoken language understanding.

Sound,Artificial Intelligence,Machine Learning,Multimedia,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **Improving Unified Audio-and-Text Large Language Models (Audio-and-Text LLM)**: Existing unified audio-and-text large language models experience performance degradation when handling tasks such as speech recognition, especially during the process of quantizing continuous speech signals into discrete tokens, which leads to information loss. The paper proposes a new unified audio-and-text large language model, LauraGPT, which represents audio input by combining continuous and discrete features to enhance performance. 2. **Simplifying the Audio Generation Process**: Current methods require a multi-step audio synthesis scheme to generate high-quality audio, which is not only complex but also difficult to handle the encoder tokens' multimodal distribution. LauraGPT introduces a one-step codec vocoder mechanism, simplifying the audio generation process and overcoming the prediction challenges posed by multimodal distribution. 3. **Supporting Various Audio-Related Tasks**: LauraGPT is capable of handling multiple tasks, including Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Text-to-Speech Synthesis (TTS), and Speech Enhancement (SE), and performs excellently in these tasks. Additionally, it provides a scalable framework that supports more complex task combinations, such as Speech-to-Speech Translation (S2ST). In summary, the main goal of the paper is to propose a new unified audio-and-text large language model, LauraGPT, to address the performance degradation issues in speech recognition and generation tasks of existing models, and to simplify the audio generation process, thereby achieving better results in various audio-related tasks.

LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

PodGPT: An audio-augmented large language model for research and education

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer

GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

AudioLM: a Language Modeling Approach to Audio Generation

AudioPaLM: A Large Language Model That Can Speak and Listen

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

UniAudio: Towards Universal Audio Generation with Large Language Models

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

AudioLCM: Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps

LLM-AD: Large Language Model based Audio Description System

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner