Visual question answering algorithm based on image caption

Wenliang Cai,Guoyong Qiu
DOI: https://doi.org/10.1109/ITNEC.2019.8729467
2019-01-01
Abstract:A number of recent works have proposed image caption models for Visual Question Answering(VQA) that explain the answer. However, we think these models do not effectively combine the two important textual information about image caption and problem, and these methods are lack of attention on image and problem information. In addition, we believe that richer text information can improve the accuracy of answer prediction. In this paper we propose a VQA algorithm based on image caption. The model uses the CNN and LSTM algorithms and the collaborative attention mechanism to generate a picture caption related to the problem information, and then combines the two text information on the image description and the question to obtain an answer and output the picture description. Our model is compared with the mainstream algorithms on MSCOCO-VQA and VQA-V2. The experimental results show that the proposed algorithm can predict answer more accurately.
What problem does this paper attempt to address?