Abstract:<p>Video based visual question answering (V-VQA) remains challenging at the intersection of vision and language. In this paper, we propose a novel architecture, namely Generalized Pyramid Co-attention with Learnable Aggregation Net (GPC) to address two central problems: 1) how to deploy co-attention to V-VQA task considering the complex and diverse content of videos; and 2) how to aggregate the frame-level features (or word-level features) without destroying the feature distributions and temporal information. To solve the first problem, we propose a Generalized Pyramid Co-attention structure with a novel diversity learning module to explicitly encourage attention accuracy and diversity. And we first instantiate it into a Multi-path Pyramid Co-attention (MPC) to capture diverse feature. Then we find each attention branch of original co-attention mechanism does not interact with the others, which results in coarse attention maps. So we extend the MPC structure to a Cascaded Pyramid Transformer Co-attention (CPTC) module in which we replace co-attention with transformer co-attention. To solve the second problem, we propose a new learnable aggregation method with a set of evidence gates. It automatically aggregates adaptively-weighted frame-level features (or word-level features) to extract rich video (or question) context semantic information. With evidence gates, it then further chooses the most related signals representing the evidence information to predict the answer. Extensive validations on the two V-VQA datasets, TGIF-QA and TVQA show that both our proposed MPC and CPTC achieve the state-of-the-art performance and CPTC performs better under various settings and metrics. Code and model have been released at:<a href="https://github.com/lixiangpengcs/LAD-Net-for-VideoQA">https://github.com/lixiangpengcs/LAD-Net-for-VideoQA</a>.</p>

TLNet: Temporal Span Localization Network with Collaborative Graph Reasoning for Video Question Answering

Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework

Question-Aware Tube-Switch Network for Video Question Answering

Spatiotemporal-Textual Co-Attention Network for Video Question Answering

Structured Two-stream Attention Network for Video Question Answering

Video Question Answering Via Multi-Granularity Temporal Attention Network Learning

Dynamic Spatio-Temporal Modular Network for Video Question Answering

Video Question Answering Via Grounded Cross-Attention Network Learning.

Discovering Spatio-Temporal Rationales for Video Question Answering

Boundary Proposal Network for Two-Stage Natural Language Video Localization

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

Adaptive Spatio-Temporal Graph Enhanced Vision-Language Representation for Video QA

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

Generalized pyramid co-attention with learnable aggregation net for video question answering

Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network

Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

Transformer-Empowered Invariant Grounding for Video Question Answering

Contrastive Video Question Answering via Video Graph Transformer

Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering

TransferNet: an Effective and Transparent Framework for Multi-hop Question Answering over Relation Graph