Abstract:The recent advancements in large language models (LLMs) have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilities into LLMs poses significant challenges, particularly with respect to generalizing across varied contexts and executing complex auditory tasks. In this work, we introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach. Leveraging dual encoders, we decouple different types of speech information, utilizing a Whisper encoder to process the semantic content of speech, and a WavLM encoder to capture the unique characteristics of the speaker's identity. Within the curriculum learning framework, WavLLM first builds its foundational capabilities by optimizing on mixed elementary single tasks, followed by advanced multi-task training on more complex tasks such as combinations of the elementary tasks. To enhance the flexibility and adherence to different tasks and instructions, a prompt-aware LoRA weight adapter is introduced in the second advanced multi-task training stage. We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set. Experiments demonstrate that the proposed model achieves state-of-the-art performance across a range of speech tasks on the same model size, exhibiting robust generalization capabilities in executing complex tasks using CoT approach. Furthermore, our model successfully completes Gaokao tasks without specialized training. The codes, models, audio, and Gaokao evaluation set can be accessed at \url{<a class="link-external link-http" href="http://aka.ms/wavllm" rel="external noopener nofollow">this http URL</a>}.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve several key challenges faced by large - language models (LLMs) when dealing with speech tasks: 1. **Instruction sensitivity**: Existing large - language models for speech experience a significant performance decline when faced with unseen or complex instructions. This is because these models are very sensitive to the instruction design for specific tasks. 2. **Lack of chain - of - thought ability**: Existing speech models lack the chain - of - thought (CoT) ability required to handle complex tasks, which limits their performance in executing multi - step tasks. To solve these problems, the paper proposes **WavLLM**, a robust and adaptable large - language model for speech. The main contributions of WavLLM include: 1. **Curriculum learning method**: Improve the generalization ability and robustness of the model through a training strategy that gradually transitions from simple tasks to complex tasks. 2. **Dual - encoder architecture**: Utilize the Whisper encoder to capture semantic information and the WavLM encoder to capture acoustic features, thereby enriching the speech representation. 3. **Prompt - aware LoRA weight adapter**: Introduce a new prompt - aware LoRA weight adapter that can dynamically adjust LoRA weights according to different prompts, further enhancing the generalization ability and robustness of the model. ### Specific problem - solving methods 1. **Curriculum learning method**: - **Mixed single - task training phase**: Use multiple single - task datasets (such as automatic speech recognition (ASR), speech - to - text translation (ST), speaker verification (SV), emotion recognition (ER), etc.) for preliminary training to optimize the modality adapter and LoRA components. - **Advanced multi - task training phase**: Combine multiple single - task instructions to construct a more complex multi - task dataset and further train the model so that it can handle complex multi - task instructions. 2. **Dual - encoder architecture**: - **Whisper encoder**: Used to extract the semantic information of speech. - **WavLM encoder**: Used to extract the acoustic features of speech, such as the unique features of the speaker. 3. **Prompt - aware LoRA weight adapter**: - **Dynamically adjust LoRA weights**: Dynamically adjust LoRA weights according to different prompts to improve the performance of the model when dealing with unseen or complex instructions. ### Experimental results The experimental results show that WavLLM performs excellently in multiple speech tasks, especially in zero - shot speech question - answering (SQA) tasks and chain - of - thought (CoT) tasks. This is specifically manifested in the following aspects: - **Single - task evaluation**: WavLLM has achieved state - of - the - art performance in tasks such as ASR, ST, SV, ER, and SQA. - **Multi - task evaluation**: When dealing with independent instructions (II - Task) and chain - of - thought tasks (CoT), WavLLM significantly outperforms other open - source large - language models for speech. ### Summary Through the curriculum learning method, dual - encoder architecture, and prompt - aware LoRA weight adapter, WavLLM effectively solves the limitations of existing large - language models for speech when dealing with complex tasks, demonstrating strong generalization ability and robustness.

WavLLM: Towards Robust and Adaptive Speech Large Language Model

WavLLM: Towards Robust and Adaptive Speech Large Language Model

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

A Survey on Speech Large Language Models

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

LLaSM: Large Language and Speech Model

Boosting Large Language Model for Speech Synthesis: An Empirical Study

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Prompting Large Language Models with Speech Recognition Abilities

Using Large Language Model for End-to-End Chinese ASR and NER

WavCraft: Audio Editing and Generation with Large Language Models

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

Pronunciation Assessment with Multi-modal Large Language Models

Efficient Streaming LLM for Speech Recognition

AudioPaLM: A Large Language Model That Can Speak and Listen