WavLLM: Towards Robust and Adaptive Speech Large Language Model

Shujie Hu,Long Zhou,Shujie Liu,Sanyuan Chen,Lingwei Meng,Hongkun Hao,Jing Pan,Xunying Liu,Jinyu Li,Sunit Sivasankaran,Linquan Liu,Furu Wei
2024-09-21
Abstract:The recent advancements in large language models (LLMs) have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilities into LLMs poses significant challenges, particularly with respect to generalizing across varied contexts and executing complex auditory tasks. In this work, we introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach. Leveraging dual encoders, we decouple different types of speech information, utilizing a Whisper encoder to process the semantic content of speech, and a WavLM encoder to capture the unique characteristics of the speaker's identity. Within the curriculum learning framework, WavLLM first builds its foundational capabilities by optimizing on mixed elementary single tasks, followed by advanced multi-task training on more complex tasks such as combinations of the elementary tasks. To enhance the flexibility and adherence to different tasks and instructions, a prompt-aware LoRA weight adapter is introduced in the second advanced multi-task training stage. We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set. Experiments demonstrate that the proposed model achieves state-of-the-art performance across a range of speech tasks on the same model size, exhibiting robust generalization capabilities in executing complex tasks using CoT approach. Furthermore, our model successfully completes Gaokao tasks without specialized training. The codes, models, audio, and Gaokao evaluation set can be accessed at \url{<a class="link-external link-http" href="http://aka.ms/wavllm" rel="external noopener nofollow">this http URL</a>}.
Computation and Language,Artificial Intelligence,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve several key challenges faced by large - language models (LLMs) when dealing with speech tasks: 1. **Instruction sensitivity**: Existing large - language models for speech experience a significant performance decline when faced with unseen or complex instructions. This is because these models are very sensitive to the instruction design for specific tasks. 2. **Lack of chain - of - thought ability**: Existing speech models lack the chain - of - thought (CoT) ability required to handle complex tasks, which limits their performance in executing multi - step tasks. To solve these problems, the paper proposes **WavLLM**, a robust and adaptable large - language model for speech. The main contributions of WavLLM include: 1. **Curriculum learning method**: Improve the generalization ability and robustness of the model through a training strategy that gradually transitions from simple tasks to complex tasks. 2. **Dual - encoder architecture**: Utilize the Whisper encoder to capture semantic information and the WavLM encoder to capture acoustic features, thereby enriching the speech representation. 3. **Prompt - aware LoRA weight adapter**: Introduce a new prompt - aware LoRA weight adapter that can dynamically adjust LoRA weights according to different prompts, further enhancing the generalization ability and robustness of the model. ### Specific problem - solving methods 1. **Curriculum learning method**: - **Mixed single - task training phase**: Use multiple single - task datasets (such as automatic speech recognition (ASR), speech - to - text translation (ST), speaker verification (SV), emotion recognition (ER), etc.) for preliminary training to optimize the modality adapter and LoRA components. - **Advanced multi - task training phase**: Combine multiple single - task instructions to construct a more complex multi - task dataset and further train the model so that it can handle complex multi - task instructions. 2. **Dual - encoder architecture**: - **Whisper encoder**: Used to extract the semantic information of speech. - **WavLM encoder**: Used to extract the acoustic features of speech, such as the unique features of the speaker. 3. **Prompt - aware LoRA weight adapter**: - **Dynamically adjust LoRA weights**: Dynamically adjust LoRA weights according to different prompts to improve the performance of the model when dealing with unseen or complex instructions. ### Experimental results The experimental results show that WavLLM performs excellently in multiple speech tasks, especially in zero - shot speech question - answering (SQA) tasks and chain - of - thought (CoT) tasks. This is specifically manifested in the following aspects: - **Single - task evaluation**: WavLLM has achieved state - of - the - art performance in tasks such as ASR, ST, SV, ER, and SQA. - **Multi - task evaluation**: When dealing with independent instructions (II - Task) and chain - of - thought tasks (CoT), WavLLM significantly outperforms other open - source large - language models for speech. ### Summary Through the curriculum learning method, dual - encoder architecture, and prompt - aware LoRA weight adapter, WavLLM effectively solves the limitations of existing large - language models for speech when dealing with complex tasks, demonstrating strong generalization ability and robustness.