Abstract:Video question answering (VideoQA) is a challenging video understanding task that requires a comprehensive understanding of multimodal information and accurate answers to related questions. Most existing VideoQA models use Graph Neural Networks (GNN) to capture temporal-spatial interactions between objects. Despite achieving certain success, we argue that current schemes have two limitations: (i) existing graph-based methods require stacking multi-layers of GNN to capture high-order relations between objects, which inevitably introduces irrelevant noise; (ii) neglecting the unique self-supervised signals in the high-order relational structures among multiple objects that can facilitate more accurate QA. To this end, we propose a novel Multi-scale Self-supervised Hypergraph Contrastive Learning (MSHCL) framework for VideoQA. Specifically, we first segment the video from multiple temporal dimensions to obtain multiple frame groups. For different frame groups, we design appearance and motion hyperedges based on node semantics to connect object nodes. In this way, we construct a multi-scale temporal-spatial hypergraph to directly capture high-order relations among multiple objects. Furthermore, the node features after hypergraph convolution are injected into a Transformer to capture the global information of the input sequence. Second, we design a self-supervised hypergraph contrastive learning task based on the node- and hyperedge-dropping data augmentation and an improved question-guided multimodal interaction module to enhance the accuracy and robustness of the VideoQA model. Finally, extensive experiments on three benchmark datasets demonstrate the superiority of our proposed MSHCL compared with stat-of-the-art methods.

Unified QA-aware Knowledge Graph Generation Based on Multi-modal Modeling

A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset

Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering

Graph-Based Multi-Interaction Network for Video Question Answering

VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering

WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning

UniKGQA: Unified Retrieval and Reasoning for Solving Multi-hop Question Answering Over Knowledge Graph

Deep Video Understanding with Video-Language Model

A multi-scale self-supervised hypergraph contrastive learning framework for video question answering

A Universal Quaternion Hypergraph Network for Multimodal Video Question Answering

Hybrid Improvements in Multimodal Analysis for Deep Video Understanding

UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving

Multi-object event graph representation learning for Video Question Answering

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Deep Relationship Analysis in Video with Multimodal Feature Fusion

Knowledge-Based Visual Question Answering in Videos

Video Question Answering Via Multi-Granularity Temporal Attention Network Learning

Event Graph Guided Compositional Spatial–Temporal Reasoning for Video Question Answering