Abstract:In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task. To be more specific, we benchmark and explore various combinations of LLMs and speech encoders, leading to the optimal LLM-based ASR system, which we call SLAM-ASR. The proposed SLAM-ASR provides a clean setup and little task-specific design, where only the linear projector is trained. To the best of our knowledge, SLAM-ASR achieves the best performance on the Librispeech benchmark among LLM-based ASR models and even outperforms the latest LLM-based audio-universal model trained on massive pair data. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use large - language models (LLMs) and speech encoders to build an efficient and high - performance automatic speech recognition (ASR) system in ASR tasks. Specifically, the authors explore a simple method to combine off - the - shelf speech encoders, large - language models (LLMs), and a linear projector that only needs to be trained to achieve strong ASR capabilities. This method challenges the necessity of current complex designs and proposes a concise setup where only the linear projector is trainable, and other components such as the speech encoder and LLMs remain frozen. In this way, the paper aims to prove that a simple architecture can also reach or even exceed the performance of existing complex - designed ASR systems. The key contributions of the paper are: 1. **Simplified Design**: A "embarrassingly simple" method is proposed to align speech and text modalities only by training a linear projector, without the need for complex model designs or extensive parameter fine - tuning. 2. **Performance Improvement**: In the LibriSpeech benchmark test, the proposed SLAM - ASR model achieves the best performance among current LLM - based ASR models and even outperforms the latest audio general - purpose models trained on large - scale paired data. 3. **Modal Alignment Ability**: The phenomenon of the emergence of capabilities in LLM - based ASR during the modal alignment process is studied, that is, in the early stage of training, the next - word prediction accuracy of the model will increase rapidly, then rise slowly, and finally "jump" at a certain point, showing the ability to "suddenly learn". In conclusion, this paper experimentally proves that a concise design can achieve excellent performance in ASR tasks, providing a new direction for future research, especially in using LLMs to expand cross - modal capabilities.

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR

Prompting Large Language Models with Speech Recognition Abilities

Connecting Speech Encoder and Large Language Model for ASR

Using Large Language Model for End-to-End Chinese ASR and NER

Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

A Survey on Speech Large Language Models

A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario

AlignFormer: Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM

MaLa-ASR: Multimedia-Assisted LLM-Based ASR

Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text

Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding

Advancing Multi-talker ASR Performance with Large Language Models

Leveraging Large Language Models for Exploiting ASR Uncertainty

Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions