Abstract:An automatic speech recognition (ASR) system is a key component in current speech-based systems. However, the surrounding acoustic noise can severely degrade the performance of an ASR system. An appealing solution to address this problem is to augment conventional audio-based ASR systems with visual features describing lip activity. This paper proposes a novel end-to-end, multitask learning (MTL), audiovisual ASR (AV-ASR) system. A key novelty of the approach is the use of MTL, where the primary task is AV-ASR, and the secondary task is audiovisual voice activity detection (AV-VAD). We obtain a robust and accurate audiovisual system that generalizes across conditions. By detecting segments with speech activity, the AV-ASR performance improves as its connectionist temporal classification (CTC) loss function can leverage from the AV-VAD alignment information. Furthermore, the end-to-end system learns from the raw audiovisual inputs a discriminative high-level representation for both speech tasks, providing the flexibility to mine information directly from the data. The proposed architecture considers the temporal dynamics within and across modalities, providing an appealing and practical fusion scheme. We evaluate the proposed approach on a large audiovisual corpus (over 60 hours), which contains different channel and environmental conditions, comparing the results with competitive single task learning (STL) and MTL baselines. Although our main goal is to improve the performance of our ASR task, the experimental results show that the proposed approach can achieve the best performance across all conditions for both speech tasks. In addition to state-of-the-art performance in AV-ASR, the proposed solution can also provide valuable information about speech activity, solving two of the most important tasks in speech-based applications.

End-to-End Audiovisual Speech Recognition System with Multitask Learning

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Incorporating VAD into ASR System by Multi-task Learning

Deep Audio-visual System for Closed-set Word-level Speech Recognition

Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training

For end-to-end audio-visual speech recognition

MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

End-to-End Audiovisual Fusion with LSTMs

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

Robust Audio-Visual ASR with Unified Cross-Modal Attention

Robust end-to-end deep audiovisual speech recognition

Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition

Deep Temporal Architecture for Audiovisual Speech Recognition

Modality Attention for End-to-end Audio-visual Speech Recognition.

Large-vocabulary Audio-visual Speech Recognition in Noisy Environments

An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario

Best of Both Worlds: Multi-Task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

AVATAR: Unconstrained Audiovisual Speech Recognition

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder