Abstract:Video question answering (VideoQA) is a challenging video understanding task that requires a comprehensive understanding of multimodal information and accurate answers to related questions. Most existing VideoQA models use Graph Neural Networks (GNN) to capture temporal-spatial interactions between objects. Despite achieving certain success, we argue that current schemes have two limitations: (i) existing graph-based methods require stacking multi-layers of GNN to capture high-order relations between objects, which inevitably introduces irrelevant noise; (ii) neglecting the unique self-supervised signals in the high-order relational structures among multiple objects that can facilitate more accurate QA. To this end, we propose a novel Multi-scale Self-supervised Hypergraph Contrastive Learning (MSHCL) framework for VideoQA. Specifically, we first segment the video from multiple temporal dimensions to obtain multiple frame groups. For different frame groups, we design appearance and motion hyperedges based on node semantics to connect object nodes. In this way, we construct a multi-scale temporal-spatial hypergraph to directly capture high-order relations among multiple objects. Furthermore, the node features after hypergraph convolution are injected into a Transformer to capture the global information of the input sequence. Second, we design a self-supervised hypergraph contrastive learning task based on the node- and hyperedge-dropping data augmentation and an improved question-guided multimodal interaction module to enhance the accuracy and robustness of the VideoQA model. Finally, extensive experiments on three benchmark datasets demonstrate the superiority of our proposed MSHCL compared with stat-of-the-art methods.

Graph-Based Multi-Interaction Network for Video Question Answering

Multi-interaction Network with Object Relation for Video Question Answering

Multi-Granularity Interaction and Integration Network for Video Question Answering

Video Question Answering Via Multi-Granularity Temporal Attention Network Learning

Video Question Answering Via Grounded Cross-Attention Network Learning.

Multi-Turn Video Question Answering Via Multi-Stream Hierarchical Attention Context Network

Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering

Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks

Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

Multi-object event graph representation learning for Video Question Answering

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

Multichannel Attention Refinement for Video Question Answering.

Modular Blended Attention Network for Video Question Answering

A Universal Quaternion Hypergraph Network for Multimodal Video Question Answering

Frame Augmented Alternating Attention Network for Video Question Answering.

A multi-scale self-supervised hypergraph contrastive learning framework for video question answering

Multi-Turn Video Question Generation Via Reinforced Multi-Choice Attention Network

Video Question Answering via Attribute-Augmented Attention Network Learning

Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

ReGR: Relation-aware graph reasoning framework for video question answering