Abstract:Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames, performing even worse than single-modality models. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. In this paper, we investigate this contrasting phenomenon from the perspective of modality bias and reveal that an excessive modality bias on the audio caused by dropout is the underlying reason. Moreover, we present the Modality Bias Hypothesis (MBH) to systematically describe the relationship between modality bias and robustness against missing modality in multimodal systems. Building on these findings, we propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality and to maintain performance and robustness simultaneously. Finally, to address an entirely missing modality, we adopt adapters to dynamically switch decision strategies. The effectiveness of our proposed approach is evaluated and validated through a series of comprehensive experiments using the MISP2021 and MISP2022 datasets. Our code is available at

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper primarily investigates the issue of modality bias caused by dropout techniques in audio-visual speech recognition (AVSR) systems when video frames are missing, and proposes a new method to address this issue. #### Specific Problem Description: 1. **Existing Issues**: Current state-of-the-art AVSR systems are highly sensitive to the absence of video frames, sometimes performing worse than unimodal models. Although applying dropout techniques to the video modality can enhance the system's robustness to missing video frames, it also leads to performance degradation when processing complete data inputs. 2. **Research Objective**: The paper aims to explain this paradoxical phenomenon from the perspective of modality decision bias and proposes a new framework to improve the robustness of AVSR systems while avoiding performance degradation when handling complete inputs. #### Main Contributions: 1. **Analysis of Modality Bias Phenomenon**: Through quantitative analysis, it is found that the modality bias caused by dropout manifests as a shift from a multimodal distribution to a unimodal distribution in the latent representation subspace. 2. **Modality Bias Hypothesis (MBH)**: The paper proposes the Modality Bias Hypothesis to systematically describe the relationship between modality bias and modality-missing robustness in multimodal systems. 3. **Multimodal Distribution Approximation and Knowledge Distillation (MDA-KD)**: A new framework is proposed that utilizes complete data to distill hidden knowledge, preventing over-reliance on a single modality. 4. **Modality-Specific Adapter (MS-Adapter)**: In cases of severe or complete video loss, an adapter is used to dynamically switch decision strategies. The effectiveness of the proposed methods is validated through a series of experiments, achieving top performance on the MISP2021 and MISP2022 datasets.

A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Robust Audio-visual Speech Recognition Using Bimodal Dfsmn with Multi-condition Training and Dropout Regularization.

On Robustness to Missing Video for Audiovisual Speech Recognition

Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos

Missingness-resilient Video-enhanced Multimodal Disfluency Detection

Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised Audio-Visual Video Parsing

On Modality Bias Recognition and Reduction

Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features

Modality Dropout for Improved Performance-driven Talking Faces

Learning Trimodal Relation for AVQA with Missing Modality

Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring

Gradient-Guided Modality Decoupling for Missing-Modality Robustness

Redundancy-Adaptive Multimodal Learning for Imperfect Data

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models

Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition

Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts

Balanced Audiovisual Dataset for Imbalance Analysis

Multimodal Sentiment Analysis under Modality Deficiency with Prototype-Augmentation in Software Engineering

Exploring the Role of Audio in Video Captioning