Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality

Sishuo Chen,Lei Li,Shuhuai Ren,Rundong Gao,Yuanxin Liu,Xiaohan Bi,Xu Sun,Lu Hou
2024-03-28
Abstract:Video paragraph captioning (VPC) involves generating detailed narratives for long videos, utilizing supportive modalities such as speech and event boundaries. However, the existing models are constrained by the assumption of constant availability of a single auxiliary modality, which is impractical given the diversity and unpredictable nature of real-world scenarios. To this end, we propose a Missing-Resistant framework MR-VPC that effectively harnesses all available auxiliary inputs and maintains resilience even in the absence of certain modalities. Under this framework, we propose the Multimodal VPC (MVPC) architecture integrating video, speech, and event boundary inputs in a unified manner to process various auxiliary inputs. Moreover, to fortify the model against incomplete data, we introduce DropAM, a data augmentation strategy that randomly omits auxiliary inputs, paired with DistillAM, a regularization target that distills knowledge from teacher models trained on modality-complete data, enabling efficient learning in modality-deficient environments. Through exhaustive experimentation on YouCook2 and ActivityNet Captions, MR-VPC has proven to deliver superior performance on modality-complete and modality-missing test data. This work highlights the significance of developing resilient VPC models and paves the way for more adaptive, robust multimodal video understanding.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problem of model sensitivity to the absence of auxiliary modalities (such as speech transcripts and event boundaries) in the Video Paragraph Captioning (VPC) task. Specifically, existing models assume that the same auxiliary modalities can be continuously obtained during both training and testing, which is not always true in the real world. The paper points out that this assumption leads to two main problems: 1. **Low utilization rate of auxiliary modalities**: Since only specific auxiliary modalities are considered during training, the model cannot utilize new modalities that appear during testing. For example, some models cannot use speech transcripts, while others cannot utilize event boundaries. 2. **Vulnerability in noisy environments**: When the required auxiliary modalities are missing or of low quality, the performance of the model will drop significantly. This situation is very common in practical applications. For example, the absence of Automatic Speech Recognition (ASR) text will lead to a significant drop in model performance. To address these problems, the paper proposes a multi - modal anti - absence framework (MR - VPC), which mainly includes the following parts: - **Multi - modal Video Paragraph Captioning (MVPC) architecture**: This architecture can integrate multiple modal inputs such as video, speech transcripts and event boundaries, and process them in a unified text feature space. - **Data augmentation strategy (DropAM)**: By randomly deleting auxiliary modal inputs to simulate modal absence, reduce the model's dependence on auxiliary modalities and improve its generalization ability in noisy environments. - **Knowledge distillation strategy (DistillAM)**: By distilling knowledge from a teacher model trained on complete modal data, the student model can also learn efficiently in the case of modal absence. Through experiments on two benchmark datasets, YouCook2 and ActivityNet Captions, the paper proves that MR - VPC can achieve excellent performance on both complete - modal and modal - absent data, demonstrating its adaptability and robustness in multi - modal video understanding tasks.