Abstract:Video question answering (VideoQA) is a challenging yet important task that requires a joint understanding of low-level video content and high-level textual semantics. Despite the promising progress of existing efforts, recent studies revealed that current VideoQA models mostly tend to over-rely on the superficial correlations rooted in the dataset bias while overlooking the key video content, thus leading to unreliable results. Effectively understanding and modeling the temporal and semantic characteristics of a given video for robust VideoQA is crucial but, to our knowledge, has not been well investigated. To fill the research gap, we propose a robust VideoQA framework that can effectively model the cross-modality fusion and enforce the model to focus on the temporal and global content of videos when making a QA decision instead of exploiting the shortcuts in datasets. Specifically, we design a self-supervised contrastive learning objective to contrast the positive and negative pairs of multimodal input, where the fused representation of the original multimodal input is enforced to be closer to that of the intervened input based on video perturbation. We expect the fused representation to focus more on the global context of videos rather than some static keyframes. Moreover, we introduce an effective temporal order regularization to enforce the inherent sequential structure of videos for video representation. We also design a Kullback-Leibler divergence-based perturbation invariance regularization of the predicted answer distribution to improve the robustness of the model against temporal content perturbation of videos. Our method is model-agnostic and can be easily compatible with various VideoQA backbones. Extensive experimental results and analyses on several public datasets show the advantage of our method over the state-of-the-art methods in terms of both accuracy and robustness.

Adversarial Query-by-Image Video Retrieval Based on Attention Mechanism

QAIR: Practical Query-efficient Black-Box Attacks for Image Retrieval

Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization

Robust video question answering via contrastive cross-modality representation learning

A Proposal-based Approach for Activity Image-to-Video Retrieval

Temporal Complementarity-Guided Reinforcement Learning for Image-to-Video Person Re-Identification

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Video-Specific Query-Key Attention Modeling for Weakly-Supervised Temporal Action Localization

Temporal Context Aggregation for Video Retrieval with Contrastive Learning

UATVR: Uncertainty-Adaptive Text-Video Retrieval

Video-based Image Retrieval

QD-VMR: Query Debiasing with Contextual Understanding Enhancement for Video Moment Retrieval

Multiview adaptive attention pooling for image-text retrieval

Attentive Moment Retrieval in Videos

Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering

Research on Video Retrieval Technology based on Multimodal Fusion and Attention Mechanism

Searching for images by video

AIGC-VQA: A Holistic Perception Metric for AIGC Video Quality Assessment

VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering

W2VV++

Using Spatial‐Temporal Attention for Video Quality Evaluation