A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition

Yusheng Dai,Hang Chen,Jun Du,Ruoyu Wang,Shihao Chen,Jiefeng Ma,Haotian Wang,Chin-Hui Lee
2024-03-07
Abstract:Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames, performing even worse than single-modality models. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. In this paper, we investigate this contrasting phenomenon from the perspective of modality bias and reveal that an excessive modality bias on the audio caused by dropout is the underlying reason. Moreover, we present the Modality Bias Hypothesis (MBH) to systematically describe the relationship between modality bias and robustness against missing modality in multimodal systems. Building on these findings, we propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality and to maintain performance and robustness simultaneously. Finally, to address an entirely missing modality, we adopt adapters to dynamically switch decision strategies. The effectiveness of our proposed approach is evaluated and validated through a series of comprehensive experiments using the MISP2021 and MISP2022 datasets. Our code is available at
Sound,Computer Vision and Pattern Recognition,Machine Learning,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper primarily investigates the issue of modality bias caused by dropout techniques in audio-visual speech recognition (AVSR) systems when video frames are missing, and proposes a new method to address this issue. #### Specific Problem Description: 1. **Existing Issues**: Current state-of-the-art AVSR systems are highly sensitive to the absence of video frames, sometimes performing worse than unimodal models. Although applying dropout techniques to the video modality can enhance the system's robustness to missing video frames, it also leads to performance degradation when processing complete data inputs. 2. **Research Objective**: The paper aims to explain this paradoxical phenomenon from the perspective of modality decision bias and proposes a new framework to improve the robustness of AVSR systems while avoiding performance degradation when handling complete inputs. #### Main Contributions: 1. **Analysis of Modality Bias Phenomenon**: Through quantitative analysis, it is found that the modality bias caused by dropout manifests as a shift from a multimodal distribution to a unimodal distribution in the latent representation subspace. 2. **Modality Bias Hypothesis (MBH)**: The paper proposes the Modality Bias Hypothesis to systematically describe the relationship between modality bias and modality-missing robustness in multimodal systems. 3. **Multimodal Distribution Approximation and Knowledge Distillation (MDA-KD)**: A new framework is proposed that utilizes complete data to distill hidden knowledge, preventing over-reliance on a single modality. 4. **Modality-Specific Adapter (MS-Adapter)**: In cases of severe or complete video loss, an adapter is used to dynamically switch decision strategies. The effectiveness of the proposed methods is validated through a series of experiments, achieving top performance on the MISP2021 and MISP2022 datasets.