Abstract:Despite notable advancements in automatic speech recognition (ASR), performance tends to degrade when faced with adverse conditions. Generative error correction (GER) leverages the exceptional text comprehension capabilities of large language models (LLM), delivering impressive performance in ASR error correction, where N-best hypotheses provide valuable information for transcription prediction. However, GER encounters challenges such as fixed N-best hypotheses, insufficient utilization of acoustic information, and limited specificity to multi-accent scenarios. In this paper, we explore the application of GER in multi-accent scenarios. Accents represent deviations from standard pronunciation norms, and the multi-task learning framework for simultaneous ASR and accent recognition (AR) has effectively addressed the multi-accent scenarios, making it a prominent solution. In this work, we propose a unified ASR-AR GER model, named MMGER, leveraging multi-modal correction, and multi-granularity correction. Multi-task ASR-AR learning is employed to provide dynamic 1-best hypotheses and accent embeddings. Multi-modal correction accomplishes fine-grained frame-level correction by force-aligning the acoustic features of speech with the corresponding character-level 1-best hypothesis sequence. Multi-granularity correction supplements the global linguistic information by incorporating regular 1-best hypotheses atop fine-grained multi-modal correction to achieve coarse-grained utterance-level correction. MMGER effectively mitigates the limitations of GER and tailors LLM-based ASR error correction for the multi-accent scenarios. Experiments conducted on the multi-accent Mandarin KeSpeech dataset demonstrate the efficacy of MMGER, achieving a 26.72% relative improvement in AR accuracy and a 27.55% relative reduction in ASR character error rate, compared to a well-established standard baseline.

Multi-stage Large Language Model Correction for Speech Recognition

ASR Error Correction using Large Language Models

Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

Correction Focused Language Model Training for Speech Recognition

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Generative error correction for code-switching speech recognition using large language models

Leveraging Large Language Models for Exploiting ASR Uncertainty

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction

Can Generative Large Language Models Perform ASR Error Correction?

Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition

Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study

Speech Recognition Rescoring with Large Speech-Text Foundation Models

MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition

Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction

A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR

Large-scale Language Model Rescoring on Long-form Data

Towards interfacing large language models with ASR systems using confidence measures and prompting

Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models