Abstract:Video Question Answering (VideoQA) requires a thorough comprehension of linguistic and visual modalities. However, recent methods confront two problems: (1) Synchronous modeling of object action and frame scene instead of a step-by-step manner, which can better mine potential semantic attributes of videos, lacks research; (2) The relationship between cross-modal alignments at different granularity of abstraction is not fully utilized. Based on these insights, we propose a novel method named hierarchical synchronization with structured multi-granularity interaction (HSSMI) for VideoQA. First, a hierarchical synchronous reasoning module is put forward to model objects’ relations and dynamics and synchronously capture their synergistic influences over time when analyzing whole frames. It is seen as multiple Object ConvLSTMs (O-CLSTMs) in isolation or a Frame ConvLSTM (F-CLSTM) in collectivity. Specifically, O-CLSTM learns the object-level action states under neighboring spatial interplays. Meanwhile, F-CLSTM learns the frame-level scene state, where action information from O-CLSTMs is selectively aggregated into a common memory cell of F-CLSTM as instructed by questions. Besides, a boundary detector is equipped to discover scene discontinuities, enabling F-CLSTM to alter its time connectivity and adapt its sequential encoding process to videos. Thereafter, we develop a conditional VLAD with topic constraints for discriminative modality summarization. Last, a structured multi-granularity interaction module is proposed to integrate complemented clues on the global alignment between scene and full question and the local alignments between action summaries and words. This module encourages useful information passing through compositional syntactical topologies of questions to predict answers. Experiments on three public benchmark datasets demonstrate the superiority of our HSSMI against other state-of-the-art methods. Codes will be publicly available at https://github.com/Qiss33/HSSMI.

Multi-Granularity Interaction and Integration Network for Video Question Answering

Detecting spamming activities by network monitoring with Bloom filters

Multi-interaction Network with Object Relation for Video Question Answering

Video Question Answering Via Grounded Cross-Attention Network Learning.

Video Question Answering Via Multi-Granularity Temporal Attention Network Learning

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

Hierarchical synchronization with structured multi-granularity interaction for video question answering

Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

Multi-Granularity Relational Attention Network for Audio-Visual Question Answering

Multichannel Attention Refinement for Video Question Answering.

Visual Question Generation Under Multi-granularity Cross-Modal Interaction.

Multi-Turn Video Question Answering Via Multi-Stream Hierarchical Attention Context Network

Multi-Turn Video Question Generation Via Reinforced Multi-Choice Attention Network

Bidirectional Signaling through EphrinA2-EphA2 Enhances Osteoclastogenesis and Suppresses Osteoblastogenesis*

Video Question Answering Via Gradually Refined Attention over Appearance and Motion

Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering

Modular Blended Attention Network for Video Question Answering

Learning Question-Guided Video Representation for Multi-Turn Video Question Answering

ReGR: Relation-aware graph reasoning framework for video question answering

Harnessing Representative Spatial-Temporal Information for Video Question Answering