Abstract:Despite notable advancements in automatic speech recognition (ASR), performance tends to degrade when faced with adverse conditions. Generative error correction (GER) leverages the exceptional text comprehension capabilities of large language models (LLM), delivering impressive performance in ASR error correction, where N-best hypotheses provide valuable information for transcription prediction. However, GER encounters challenges such as fixed N-best hypotheses, insufficient utilization of acoustic information, and limited specificity to multi-accent scenarios. In this paper, we explore the application of GER in multi-accent scenarios. Accents represent deviations from standard pronunciation norms, and the multi-task learning framework for simultaneous ASR and accent recognition (AR) has effectively addressed the multi-accent scenarios, making it a prominent solution. In this work, we propose a unified ASR-AR GER model, named MMGER, leveraging multi-modal correction, and multi-granularity correction. Multi-task ASR-AR learning is employed to provide dynamic 1-best hypotheses and accent embeddings. Multi-modal correction accomplishes fine-grained frame-level correction by force-aligning the acoustic features of speech with the corresponding character-level 1-best hypothesis sequence. Multi-granularity correction supplements the global linguistic information by incorporating regular 1-best hypotheses atop fine-grained multi-modal correction to achieve coarse-grained utterance-level correction. MMGER effectively mitigates the limitations of GER and tailors LLM-based ASR error correction for the multi-accent scenarios. Experiments conducted on the multi-accent Mandarin KeSpeech dataset demonstrate the efficacy of MMGER, achieving a 26.72% relative improvement in AR accuracy and a 27.55% relative reduction in ASR character error rate, compared to a well-established standard baseline.

Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition

E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition

Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition

Asymmetric Acoustic Model for Accented Speech Recognition

Multilingual Approach to Joint Speech and Accent Recognition with DNN-HMM Framework

END-TO-END MULTI-ACCENT SPEECH RECOGNITION WITH UNSUPERVISED ACCENT MODELLING

Accent Recognition with Hybrid Phonetic Features

Improved BLSTM RNN Based Accent Speech Recognition Using Multi-task Learning and Accent Embeddings

MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition

Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis

Joint Training Of Complex Ratio Mask Based Beamformer And Acoustic Model For Noise Robust Asr

A Highly Adaptive Acoustic Model for Accurate Multi-Dialect Speech Recognition

Acoustic Model Reconstruction for Multi-Accent Chinese Speech Recognition

Reliable accent specific unit generation with dynamic Gaussian mixture selection for multi-accent speech recognition

A Unified Recognition and Correction Model under Noisy and Accent Speech Conditions

Deep joint learning for language recognition

Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition

Merging of British and American accents for embedded pronunciation scoring applications

Joint Modeling for ASR Correction and Dialog State Tracking

Investigation of Deep Neural Network Acoustic Modelling Approaches for Low Resource Accented Mandarin Speech Recognition

Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition.