Abstract:Video question answering (VideoQA) is a challenging yet important task that requires a joint understanding of low-level video content and high-level textual semantics. Despite the promising progress of existing efforts, recent studies revealed that current VideoQA models mostly tend to over-rely on the superficial correlations rooted in the dataset bias while overlooking the key video content, thus leading to unreliable results. Effectively understanding and modeling the temporal and semantic characteristics of a given video for robust VideoQA is crucial but, to our knowledge, has not been well investigated. To fill the research gap, we propose a robust VideoQA framework that can effectively model the cross-modality fusion and enforce the model to focus on the temporal and global content of videos when making a QA decision instead of exploiting the shortcuts in datasets. Specifically, we design a self-supervised contrastive learning objective to contrast the positive and negative pairs of multimodal input, where the fused representation of the original multimodal input is enforced to be closer to that of the intervened input based on video perturbation. We expect the fused representation to focus more on the global context of videos rather than some static keyframes. Moreover, we introduce an effective temporal order regularization to enforce the inherent sequential structure of videos for video representation. We also design a Kullback-Leibler divergence-based perturbation invariance regularization of the predicted answer distribution to improve the robustness of the model against temporal content perturbation of videos. Our method is model-agnostic and can be easily compatible with various VideoQA backbones. Extensive experimental results and analyses on several public datasets show the advantage of our method over the state-of-the-art methods in terms of both accuracy and robustness.

Video Question Answering Via Gradually Refined Attention over Appearance and Motion

Multichannel Attention Refinement for Video Question Answering.

Unifying the Video and Question Attentions for Open-Ended Video Question Answering.

Frame Augmented Alternating Attention Network for Video Question Answering.

Video Question Answering Via Multi-Granularity Temporal Attention Network Learning

Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network

Video Question Answering Via Grounded Cross-Attention Network Learning.

Video Question Answering via Attribute-Augmented Attention Network Learning

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

Motion-Appearance Co-Memory Networks for Video Question Answering

Video Question Answering Using CLIP-Guided Visual-Text Attention

Memory Augmented Deep Recurrent Neural Network for Video Question Answering

Question-Guided Erasing-Based Spatiotemporal Attention Learning for Video Question Answering

Robust video question answering via contrastive cross-modality representation learning

Glance and Focus: Memory Prompting for Multi-Event Video Question Answering

The Forgettable-Watcher Model for Video Question Answering

Video Question Answering: a Survey of Models and Datasets

Video Question Answering Via Hierarchical Dual-Level Attention Network Learning.