Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

Yosuke Higuchi,Tetsuji Ogawa,Tetsunori Kobayashi
2024-09-30
Abstract:We propose to utilize an instruction-tuned large language model (LLM) for guiding the text generation process in automatic speech recognition (ASR). Modern large language models (LLMs) are adept at performing various text generation tasks through zero-shot learning, prompted with instructions designed for specific objectives. This paper explores the potential of LLMs to derive linguistic information that can facilitate text generation in end-to-end ASR models. Specifically, we instruct an LLM to correct grammatical errors in an ASR hypothesis and use the LLM-derived representations to refine the output further. The proposed model is built on the joint CTC and attention architecture, with the LLM serving as a front-end feature extractor for the decoder. The ASR hypothesis, subject to correction, is obtained from the encoder via CTC decoding and fed into the LLM along with a specific instruction. The decoder subsequently takes as input the LLM output to perform token predictions, combining acoustic information from the encoder and the powerful linguistic information provided by the LLM. Experimental results show that the proposed LLM-guided model achieves a relative gain of approximately 13\% in word error rates across major benchmarks.
Audio and Speech Processing,Computation and Language,Sound
What problem does this paper attempt to address?
The problem this paper attempts to address is improving the performance of end-to-end automatic speech recognition (ASR) by leveraging large language models (LLMs) fine-tuned with instructions to guide the text generation process in ASR. Specifically, the authors propose a method that uses LLMs as front-end feature extractors for the decoder and designs effective prompting strategies to enable LLMs to correct grammatical errors in ASR hypotheses, thereby optimizing the final output. ### Main Contributions of the Paper 1. **Proposed a new model architecture**: This model is based on joint CTC and attention mechanisms, enhancing language modeling capabilities by introducing LLMs as front-end feature extractors for the decoder. 2. **Designed effective prompting strategies**: Through specific prompts, LLMs can effectively correct grammatical errors in ASR hypotheses. 3. **Experimentally validated the model's effectiveness**: Experiments conducted on multiple benchmark datasets showed significant relative improvements in word error rate (WER), with gains of up to 13%. ### Key Technical Points - **Joint CTC and attention mechanisms**: Combines the advantages of CTC and attention mechanisms, improving the model's robustness and accuracy. - **LLM as a feature extractor**: Utilizes pre-trained LLMs (such as Llama2) as the front-end of the decoder to extract rich language features. - **Zero-shot learning**: By designing specific prompts, LLMs can perform grammatical error correction tasks without task-specific training data. - **Cross-attention mechanism**: Aligns the language features extracted by LLMs with the speech information extracted by the encoder, reducing hallucinations and over-corrections. ### Experimental Results - **Performance on multiple datasets**: Including datasets like LibriSpeech, TED-LIUM2, and CoVoST2, experimental results show that the model significantly outperforms baseline models in terms of word error rate. - **Ablation studies**: Ablation studies validated the importance of LLMs and prompting strategies, further proving the effectiveness of the model design. ### Conclusion This paper proposes a new method for guiding end-to-end automatic speech recognition using large language models fine-tuned with instructions. By using LLMs as front-end feature extractors for the decoder and designing effective prompting strategies, the model achieves significant performance improvements on multiple benchmark datasets. Future research can explore the use of lightweight, efficient LLMs to reduce computational costs.