Abstract:We propose to utilize an instruction-tuned large language model (LLM) for guiding the text generation process in automatic speech recognition (ASR). Modern large language models (LLMs) are adept at performing various text generation tasks through zero-shot learning, prompted with instructions designed for specific objectives. This paper explores the potential of LLMs to derive linguistic information that can facilitate text generation in end-to-end ASR models. Specifically, we instruct an LLM to correct grammatical errors in an ASR hypothesis and use the LLM-derived representations to refine the output further. The proposed model is built on the joint CTC and attention architecture, with the LLM serving as a front-end feature extractor for the decoder. The ASR hypothesis, subject to correction, is obtained from the encoder via CTC decoding and fed into the LLM along with a specific instruction. The decoder subsequently takes as input the LLM output to perform token predictions, combining acoustic information from the encoder and the powerful linguistic information provided by the LLM. Experimental results show that the proposed LLM-guided model achieves a relative gain of approximately 13\% in word error rates across major benchmarks.

What problem does this paper attempt to address?

The problem this paper attempts to address is improving the performance of end-to-end automatic speech recognition (ASR) by leveraging large language models (LLMs) fine-tuned with instructions to guide the text generation process in ASR. Specifically, the authors propose a method that uses LLMs as front-end feature extractors for the decoder and designs effective prompting strategies to enable LLMs to correct grammatical errors in ASR hypotheses, thereby optimizing the final output. ### Main Contributions of the Paper 1. **Proposed a new model architecture**: This model is based on joint CTC and attention mechanisms, enhancing language modeling capabilities by introducing LLMs as front-end feature extractors for the decoder. 2. **Designed effective prompting strategies**: Through specific prompts, LLMs can effectively correct grammatical errors in ASR hypotheses. 3. **Experimentally validated the model's effectiveness**: Experiments conducted on multiple benchmark datasets showed significant relative improvements in word error rate (WER), with gains of up to 13%. ### Key Technical Points - **Joint CTC and attention mechanisms**: Combines the advantages of CTC and attention mechanisms, improving the model's robustness and accuracy. - **LLM as a feature extractor**: Utilizes pre-trained LLMs (such as Llama2) as the front-end of the decoder to extract rich language features. - **Zero-shot learning**: By designing specific prompts, LLMs can perform grammatical error correction tasks without task-specific training data. - **Cross-attention mechanism**: Aligns the language features extracted by LLMs with the speech information extracted by the encoder, reducing hallucinations and over-corrections. ### Experimental Results - **Performance on multiple datasets**: Including datasets like LibriSpeech, TED-LIUM2, and CoVoST2, experimental results show that the model significantly outperforms baseline models in terms of word error rate. - **Ablation studies**: Ablation studies validated the importance of LLMs and prompting strategies, further proving the effectiveness of the model design. ### Conclusion This paper proposes a new method for guiding end-to-end automatic speech recognition using large language models fine-tuned with instructions. By using LLMs as front-end feature extractors for the decoder and designing effective prompting strategies, the model achieves significant performance improvements on multiple benchmark datasets. Future research can explore the use of lightweight, efficient LLMs to reduce computational costs.

Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

Multi-stage Large Language Model Correction for Speech Recognition

Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models

Instruction-Following Speech Recognition

Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

Instruction Position Matters in Sequence Generation with Large Language Models

Applying Large Language Models for Automated Essay Scoring for Non-Native Japanese

Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

Evaluating the Zero-shot Robustness of Instruction-tuned Language Models

Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction

Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

AlignFormer: Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM

Building Accurate Translation-Tailored LLMs with Language Aware Instruction Tuning

ASR Error Correction using Large Language Models

Correction Focused Language Model Training for Speech Recognition

Leveraging Large Language Models for Exploiting ASR Uncertainty

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition

Prompting Large Language Models with Speech Recognition Abilities