Abstract:Audio-Visual Speech Recognition (AVSR) is a promising approach to improving the accuracy and robustness of speech recognition systems with the assistance of visual cues in challenging acoustic environments. In this paper, we present a novel audio-visual speech recognition architecture with unified cross-modal attention. Our approach concatenates the sequences temporally from different modalities and encodes the fused sequence in the unified feature space using a shared Conformer encoder. We then explicitly model additive noise and potential out-of-sync samples during training, and propose an auxiliary asynchronization-aware loss to improve the system performance on out-of-sync data. To enhance the efficacy of unified cross-modal attention, a manual attention alignment strategy is designed and applied to the model, bringing additional gains in both recognition accuracy and computation cost. As demonstrated by experiments on the large-scale audio-visual LRS3 dataset, our proposed approach reduces the word error rate (WER) by relatively 50 compared to the audio-only single-modal ASR system under noisy conditions, and relatively 25 compared to the previous audio-visual ASR baseline. The proposed audio-visual ASR system also shows superior robustness in more challenging conditions, such as audio-only data, visual corruption, audio-visual misalignment, and multi-talker interference. Moreover, the proposed Unified Cross-Modal Attention model exhibits a more general ability in multi-modality fusion, allowing for easy integration of additional modalities into the model with this framework to achieve a more accurate, robust, and safer multi-modal system.

Cross-utterance ASR Rescoring with Graph-based Label Propagation

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Speech Recognition Rescoring with Large Speech-Text Foundation Models

Effective Cross-Utterance Language Modeling for Conversational Speech Recognition

Cross-utterance Reranking Models with BERT and Graph Convolutional Networks for Conversational Speech Recognition

Class-Based Neural Network Language Model for Second-Pass Rescoring in ASR

Transformer-based Model for ASR N-Best Rescoring and Rewriting

Combining Hybrid DNN-HMM ASR Systems with Attention-Based Models Using Lattice Rescoring

LT-LM: a novel non-autoregressive language model for single-shot lattice rescoring

Cross-lingual Automatic Speech Recognition Exploiting Articulatory Features

ProGRes: Prompted Generative Rescoring on ASR n-Best

Crossmodal ASR Error Correction with Discrete Speech Units

Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking

Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition

Enhancing CTC-based speech recognition with diverse modeling units

Initial Decoding with Minimally Augmented Language Model for Improved Lattice Rescoring in Low Resource ASR

Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model

Neural Lattice Search for Speech Recognition.

Lattice Rescoring Based on Large Ensemble of Complementary Neural Language Models

Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond