Question-Driven Multiple Attention(DQMA) Model for Visual Question Answer

Jinmeng Wu,Lei Ma,Fulin Ge,Yanbin Hao,Pengcheng Shu
DOI: https://doi.org/10.1109/aicit55386.2022.9930294
2022-01-01
Abstract:Visual Question and Answer (VQA) refers to a typical multimodal problem in the fields of computer vision and natural language processing, which aims to give an open-ended question about an image that can be answered accurately. The currently existing visual question answer models inevitably introduce redundant and inaccurate visual information when exploring the rich interaction between complex image targets and texts, and they also fail to focus effectively on the targets in the scene. To address this problem, the Question-Driven Multiple Attention Model (QDMA) is proposed. Firstly, Faster R-CNN and LSTM are used to extract visual features of images and textual features of questions. Then we design a question-driven attention network to obtain question regions of interest in images so that the model can accurately target relevant targets in complex scenes. To establish intensive interaction between the image region of interest and the question word, the co-attentive network consisting of self-attentive and guided-attentive units is introduced. Finally, the correct answer is obtained by inputting question features and image features into an answer prediction module consisting of two-layer Multi-Layer Perceptron. On the VQA2.0 dataset, the suggested method is empirically compared with other methods. The results reveal that the model outperforms other methods, demonstrating the usefulness of the framework.
What problem does this paper attempt to address?