Abstract:Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and Test_Meeting test sets. Our analysis presents an empirical foundation for future research in LLM-based ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pre-trained models and training logs to promote reproducible research.

What problem does this paper attempt to address?

The paper primarily explores the potential application of large language models (LLMs) in the field of automatic speech recognition (ASR), particularly focusing on Chinese open-source datasets. The core objective of the research is to evaluate the impact of different speech encoders, large language models, and projection modules on the performance of ASR systems based on the "speech foundation encoder + LLM decoder" paradigm. Specifically, the research team explored the following aspects: 1. **Speech Encoder**: Compared the effectiveness of the supervised-trained Whisper and the self-supervised-trained HuBERT models as speech encoders. 2. **Projection Module**: Compared the performance of Qformer and Transformer types of projection modules in the task. 3. **Large Language Models (LLM)**: Analyzed the impact of different LLMs (such as Atom-7B and Baichuan2-7B-Chat) on the overall system performance. 4. **Training Strategy**: Proposed a three-stage training method aimed at optimizing the alignment between speech and text modalities. Through experiments, the authors reached the following key conclusions: - For speech encoders, Whisper is more robust compared to HuBERT but has lower adaptability; - In terms of projection modules, Transformer has better learning capabilities than Qformer; - The performance of LLMs is positively correlated with their proficiency in the specific language (Chinese in this study); - The proposed three-stage training method effectively enhances the model's ability to learn the alignment between speech and text modalities, and can achieve state-of-the-art (SOTA) results even with relatively small Chinese datasets. Ultimately, under the optimal configuration (i.e., using HuBERT as the encoder, Transformer as the projection module, and Baichuan2-7B-Chat as the LLM), the proposed model achieved state-of-the-art performance results on test sets such as AISHELL-1, Test Net, and Test Meeting. Additionally, the authors plan to release all scripts used for data preparation, training, inference, and scoring, as well as the pre-trained models and training logs, to promote reproducible research.

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Using Large Language Model for End-to-End Chinese ASR and NER

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

A Survey on Speech Large Language Models

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study

YuLan: An Open-source Large Language Model

A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction

Connecting Speech Encoder and Large Language Model for ASR

Advancing Multi-talker ASR Performance with Large Language Models

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition

A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR

WavLLM: Towards Robust and Adaptive Speech Large Language Model

Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text

Prompting Large Language Models with Speech Recognition Abilities