Spatiotemporal-Textual Co-Attention Network for Video Question Answering

Zheng-Jun Zha,Jiawei Liu,Tianhao Yang,Yongdong Zhang
DOI: https://doi.org/10.1145/3320061
2019-01-01
Abstract:Visual Question Answering (VQA) is to provide a natural language answer for a pair of an image or video and a natural language question. Despite recent progress on VQA, existing works primarily focus on image question answering and are suboptimal for video question answering. This article presents a novel Spatiotemporal-Textual Co-Attention Network (STCA-Net) for video question answering. The STCA-Net jointly learns spatially and temporally visual attention on videos as well as textual attention on questions. It concentrates on the essential cues in both visual and textual spaces for answering question, leading to effective question-video representation. In particular, a question-guided attention network is designed to learn question-aware video representation with a spatial-temporal attention module. It concentrates the network on regions of interest within the frames of interest across the entire video. A video-guided attention network is proposed to learn video-aware question representation with a textual attention module, leading to fine-grained understanding of question. The learned video and question representations are used by an answer predictor to generate answers. Extensive experiments on two challenging datasets of video question answering, i.e., MSVD-QA and MSRVTT-QA, have shown the effectiveness of the proposed approach.
What problem does this paper attempt to address?