Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Xuelong Geng,Tianyi Xu,Kun Wei,Bingshen Mu,Hongfei Xue,He Wang,Yangze Li,Pengcheng Guo,Yuhang Dai,Longhao Li,Mingchen Shao,Lei Xie
2024-05-06
Abstract:Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and Test_Meeting test sets. Our analysis presents an empirical foundation for future research in LLM-based ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pre-trained models and training logs to promote reproducible research.
Sound,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily explores the potential application of large language models (LLMs) in the field of automatic speech recognition (ASR), particularly focusing on Chinese open-source datasets. The core objective of the research is to evaluate the impact of different speech encoders, large language models, and projection modules on the performance of ASR systems based on the "speech foundation encoder + LLM decoder" paradigm. Specifically, the research team explored the following aspects: 1. **Speech Encoder**: Compared the effectiveness of the supervised-trained Whisper and the self-supervised-trained HuBERT models as speech encoders. 2. **Projection Module**: Compared the performance of Qformer and Transformer types of projection modules in the task. 3. **Large Language Models (LLM)**: Analyzed the impact of different LLMs (such as Atom-7B and Baichuan2-7B-Chat) on the overall system performance. 4. **Training Strategy**: Proposed a three-stage training method aimed at optimizing the alignment between speech and text modalities. Through experiments, the authors reached the following key conclusions: - For speech encoders, Whisper is more robust compared to HuBERT but has lower adaptability; - In terms of projection modules, Transformer has better learning capabilities than Qformer; - The performance of LLMs is positively correlated with their proficiency in the specific language (Chinese in this study); - The proposed three-stage training method effectively enhances the model's ability to learn the alignment between speech and text modalities, and can achieve state-of-the-art (SOTA) results even with relatively small Chinese datasets. Ultimately, under the optimal configuration (i.e., using HuBERT as the encoder, Transformer as the projection module, and Baichuan2-7B-Chat as the LLM), the proposed model achieved state-of-the-art performance results on test sets such as AISHELL-1, Test Net, and Test Meeting. Additionally, the authors plan to release all scripts used for data preparation, training, inference, and scoring, as well as the pre-trained models and training logs, to promote reproducible research.