Abstract:Video Question Answering (VideoQA) requires a thorough comprehension of linguistic and visual modalities. However, recent methods confront two problems: (1) Synchronous modeling of object action and frame scene instead of a step-by-step manner, which can better mine potential semantic attributes of videos, lacks research; (2) The relationship between cross-modal alignments at different granularity of abstraction is not fully utilized. Based on these insights, we propose a novel method named hierarchical synchronization with structured multi-granularity interaction (HSSMI) for VideoQA. First, a hierarchical synchronous reasoning module is put forward to model objects’ relations and dynamics and synchronously capture their synergistic influences over time when analyzing whole frames. It is seen as multiple Object ConvLSTMs (O-CLSTMs) in isolation or a Frame ConvLSTM (F-CLSTM) in collectivity. Specifically, O-CLSTM learns the object-level action states under neighboring spatial interplays. Meanwhile, F-CLSTM learns the frame-level scene state, where action information from O-CLSTMs is selectively aggregated into a common memory cell of F-CLSTM as instructed by questions. Besides, a boundary detector is equipped to discover scene discontinuities, enabling F-CLSTM to alter its time connectivity and adapt its sequential encoding process to videos. Thereafter, we develop a conditional VLAD with topic constraints for discriminative modality summarization. Last, a structured multi-granularity interaction module is proposed to integrate complemented clues on the global alignment between scene and full question and the local alignments between action summaries and words. This module encourages useful information passing through compositional syntactical topologies of questions to predict answers. Experiments on three public benchmark datasets demonstrate the superiority of our HSSMI against other state-of-the-art methods. Codes will be publicly available at https://github.com/Qiss33/HSSMI.

From Coarse to Fine: Hierarchical Structure-aware Video Summarization

Hierarchical organization for medical video summarization using latent visual and semantic analysis

An Unsupervised Video Summarization Method Based on Multimodal Representation.

A GAN Based Video Summarization Method with Representation Loss

Update Summarization using a Multi-level Hierarchical Dirichlet Process Model.

Learning Multiscale Hierarchical Attention for Video Summarization

Hierarchical multi‐modal video summarization with dynamic sampling

Learning Hierarchical Video Representation for Action Recognition

Action Parsing-Driven Video Summarization Based on Reinforcement Learning

User-Ranking Video Summarization with Multi-Stage Spatio-Temporal Representation.

Automatically Generating Hierarchical Summary for Film Video

Video Joint Modelling Based on Hierarchical Transformer for Co-summarization

Abstractive Summarization Guided by Latent Hierarchical Document Structure

Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions

Category Driven Deep Recurrent Neural Network for Video Summarization

A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning

Video Summarization with Long Short-term Memory

A Hierarchical Network for Abstractive Meeting Summarization with Cross-Domain Pretraining

Progressive Reinforcement Learning for Video Summarization

Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description

Hierarchical synchronization with structured multi-granularity interaction for video question answering