Video Question Answering Using a Forget Memory Network

Yuanyuan Ge,Youjiang Xu,Yahong Han
DOI: https://doi.org/10.1007/978-981-10-7299-4_33
2017-01-01
Abstract:Visual question answering combines the fields of computer vision and natural language processing. It has received much attention in recent years. Image question answering (Image QA) targets to automatically answer questions about visual content of an image. Different from Image QA, video question answering (Video QA) needs to explore a sequence of images to answer the question. It is difficult to focus on the local region features which are related to the question from a sequence of images. In this paper, we propose a forget memory network (FMN) for Video QA to solve this problem. When the forget memory network embeds the video frame features, it can select the local region features that are related to the question and forget the irrelevant features to the question. Then we use the embedded video and question features to predict the answer from multiple-choice answers. Our proposed approaches achieve good performance on the MovieQA [21] and TACoS [28] dataset.
What problem does this paper attempt to address?