Initialized Frame Attention Networks for Video Question Answering.

Kun Gao,Xianglei Zhu,Yahong Han
DOI: https://doi.org/10.1007/978-981-10-8530-7_34
2017-01-01
Abstract:Video Question Answering (Video QA) is one of the important and challenging problems in multimedia and computer vision research. In this paper, we propose a novel framework, called initialized frame attention networks (IFAN). This framework uses long short term memory (LSTM) networks to encode visual information of videos, then initializes the language model by the encoded features. Based on the visual and semantic features, we can get an appropriate answer. In particular, in this IFAN framework, we effectively integrate temporal attention mechanism to focus on the salient frames of videos, which are associated to the questions. In order to verify the effectiveness of the proposed framework, we conduct experiments on TACoS dataset. It achieves good performances on both hard level and easy level of TACoS dataset.
What problem does this paper attempt to address?