Abstract:An automatic speech recognition (ASR) system is a key component in current speech-based systems. However, the surrounding acoustic noise can severely degrade the performance of an ASR system. An appealing solution to address this problem is to augment conventional audio-based ASR systems with visual features describing lip activity. This paper proposes a novel end-to-end, multitask learning (MTL), audiovisual ASR (AV-ASR) system. A key novelty of the approach is the use of MTL, where the primary task is AV-ASR, and the secondary task is audiovisual voice activity detection (AV-VAD). We obtain a robust and accurate audiovisual system that generalizes across conditions. By detecting segments with speech activity, the AV-ASR performance improves as its connectionist temporal classification (CTC) loss function can leverage from the AV-VAD alignment information. Furthermore, the end-to-end system learns from the raw audiovisual inputs a discriminative high-level representation for both speech tasks, providing the flexibility to mine information directly from the data. The proposed architecture considers the temporal dynamics within and across modalities, providing an appealing and practical fusion scheme. We evaluate the proposed approach on a large audiovisual corpus (over 60 hours), which contains different channel and environmental conditions, comparing the results with competitive single task learning (STL) and MTL baselines. Although our main goal is to improve the performance of our ASR task, the experimental results show that the proposed approach can achieve the best performance across all conditions for both speech tasks. In addition to state-of-the-art performance in AV-ASR, the proposed solution can also provide valuable information about speech activity, solving two of the most important tasks in speech-based applications.

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Audio-Visual Efficient Conformer for Robust Speech Recognition

Deep Audio-visual System for Closed-set Word-level Speech Recognition

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

Robust Audio-Visual ASR with Unified Cross-Modal Attention

Robust end-to-end deep audiovisual speech recognition

Large-vocabulary Audio-visual Speech Recognition in Noisy Environments

MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

Audio Visual Speech Recognition with Multimodal Recurrent Neural Networks

DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition

Robust Audio-visual Speech Recognition Using Bimodal Dfsmn with Multi-condition Training and Dropout Regularization.

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

Robust Audio-Visual Speech Recognition Based on Hybrid Fusion

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

End-to-End Audiovisual Speech Recognition System with Multitask Learning