Abstract:Video Question Answering (VideoQA) aims to comprehend intricate relationships, actions, and events within video content, as well as the inherent links between objects and scenes, to answer text-based questions accurately. Transferring knowledge from the cross-modal pre-trained model CLIP is a natural approach, but its dual-tower structure hinders fine-grained modality interaction, posing challenges for direct application to VideoQA tasks. To address this issue, we introduce a Language-Guided Visual Aggregation (LGVA) network. It employs CLIP as an effective feature extractor to obtain language-aligned visual features with different granularities and avoids resource-intensive video pre-training. The LGVA network progressively aggregates visual information in a bottom-up manner, focusing on both regional and temporal levels, and ultimately facilitating accurate answer prediction. More specifically, it employs local cross-attention to combine pre-extracted question tokens and region embeddings, pinpointing the object of interest in the question. Then, graph attention is utilized to aggregate regions at the frame level and integrate additional captions for enhanced detail. Following this, global cross-attention is used to merge sentence and frame-level embeddings, identifying the video segment relevant to the question. Ultimately, contrastive learning is applied to optimize the similarities between aggregated visual and answer embeddings, unifying upstream and downstream tasks. Our method conserves resources by avoiding large-scale video pre-training and simultaneously demonstrates commendable performance on the NExT-QA, MSVD-QA, MSRVTT-QA, TGIF-QA, and ActivityNet-QA datasets, even outperforming some end-to-end trained models. Our code is available at https://github.com/ecoxial2007/LGVA_VideoQA.

Multi-object event graph representation learning for Video Question Answering

Cross-Modal Reasoning with Event Correlation for Video Question Answering

Event Graph Guided Compositional Spatial–Temporal Reasoning for Video Question Answering

Multimodal Graph Reasoning and Fusion for Video Question Answering

Graph-Based Multi-Interaction Network for Video Question Answering

Video Question Answering Via Grounded Cross-Attention Network Learning.

Learning Question-Guided Video Representation for Multi-Turn Video Question Answering

Top-down Activity Representation Learning for Video Question Answering

A multi-scale self-supervised hypergraph contrastive learning framework for video question answering

Dynamic Spatio-Temporal Graph Reasoning for VideoQA With Self-Supervised Event Recognition

Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering

Multi-interaction Network with Object Relation for Video Question Answering

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

Video Question Answering Via Multi-Granularity Temporal Attention Network Learning

Contrastive Video Question Answering via Video Graph Transformer

Language-Guided Visual Aggregation Network for Video Question Answering

Location-Aware Graph Convolutional Networks for Video Question Answering

Joint Learning of Object Graph and Relation Graph for Visual Question Answering

Progressive Graph Attention Network for Video Question Answering

DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering

VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering