Two-step Joint Attention Network for Visual Question Answering

Weiming Zhang,Chunhong Zhang,Pei Liu,Zhiqiang Zhan,Xiaofeng Qiu
DOI: https://doi.org/10.1109/bigcom.2017.17
2017-01-01
Abstract:Visual Question Answering(VQA) system is a task that answers natural language questions automatically according to the content of a reference image. Common method for VQA is to extract image feature and question feature by using deep neural network, and then combine the two features with attention mechanism to predict answer. Most of the attention methods for VQA merely concern about where the local regions of image are relevant to answer and ignore the question words have different weights to answer. Hence, we propose two-step joint attention that use the combining representation of the image feature and question feature to guide visual attention and question attention. Two-step joint attention is able to focus the given image and question from coarse-drained parts to fine-grained parts gradually to predict answer. For purpose of extracting image feature precisely, we also propose a BiSRU and use RNN based on BiSRU to allow the adjacent local region vectors of the image to maintain information each other. We demonstrate and analyze the effectiveness on the VQA dataset, and use visualization to show the results intuitively.
What problem does this paper attempt to address?