Abstract:Long-term Video Question Answering plays an essential role in visual information retrieval, which aims at generating natural language answers to discretionary free-form questions about the referenced long-term video. Rather than remember the video as a sequence of visual content, humans have an innate cognitive ability to identify the critical moments related to the question at first glance, then tie together the specific evidence around these critical moments for further analysis and reasoning. Motivated by this intuition, we propose the multimodal hierarchical memory attentive networks with two heterogeneous memory subnetworks: the top guided memory network and the bottom enhanced multimodal memory attentive network. The top guided memory network serves as a shallow inference engine to pick relevant and informative moments of questions and obtain salient video content at a coarse-grained level. Subsequently, the bottom enhanced multimodal memory attentive network is designed as an in-depth reasoning engine to perform more accurate attention with cues from video bottom evidence in a fine-grained level to enhance question answering quality. We evaluate the proposed method on three publicly available video question answering benchmarks, namely ActivityNet-QA, MSRVTT-QA, and MSVD-QA. Experimental results demonstrate that the proposed approach significantly outperforms other state-of-the-art methods for long-term videos. Extensive ablation studies are carried out to explore the reasons behind the proposed model’s effectiveness.

Frame augmented alternating attention network for video question answering

Frame Augmented Alternating Attention Network for Video Question Answering.

Video question answering by frame attention

Video Question Answering via Attribute-Augmented Attention Network Learning

Initialized Frame Attention Networks for Video Question Answering.

Structured Two-stream Attention Network for Video Question Answering

Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

Video Question Answering Via Multi-Granularity Temporal Attention Network Learning

Feature Augmented Memory with Global Attention Network for VideoQA

Video question answering via gradually refined attention over appearance and motion

Video Question Answering Via Grounded Cross-Attention Network Learning.

Long-Term Video Question Answering Via Multimodal Hierarchical Memory Attentive Networks

Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network

Hierarchical Recurrent Contextual Attention Network for Video Question Answering

Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering

Relation-aware Hierarchical Attention Framework for Video Question Answering

Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

Multichannel Attention Refinement for Video Question Answering.

FHGN: Frame-Level Heterogeneous Graph Networks for Video Question Answering

Memory Augmented Deep Recurrent Neural Network for Video Question Answering