Complementary Spatiotemporal Network for Video Question Answering.

Li Xinrui,Wu Aming,Han Yahong
DOI: https://doi.org/10.1007/s00530-021-00805-6
IF: 3.9
2021-01-01
Multimedia Systems
Abstract:Video question answering (VideoQA) is challenging as it requires models to capture motion and spatial semantics and to associate them with linguistic contexts. Recent methods usually treat space and time symmetrically. Since the spatial structures and temporal events often change at different speeds in the video, these methods will be difficult to distinguish spatial details and different scale motion relationships. To this end, we propose a complementary spatiotemporal network (CST) to focus on multi-scale motion relationships and essential spatial semantics. Our model involves three modules. First, multi-scale relation unit (MR) captures temporal information by modeling different distances between motions. Second, mask similarity (MS) operation captures discriminative spatial semantics in a less redundant manner. And cross-modality attention (CMA) boosts the interaction between different modalities. We evaluate our method on three benchmark datasets and conduct extensive ablation studies. The performance improvement demonstrates the effectiveness of our approach.
What problem does this paper attempt to address?