Question-oriented Cross-Modal Co-Attention Networks for Visual Question Answering

Wei Guan,Zhenyu Wu,Wen Ping
DOI: https://doi.org/10.1109/iccece54139.2022.9712726
2022-01-01
Abstract:Aiming at the problem that the existing cross-modal co-attention models in visual question answering (VQA) task dealing with text and image information lacks focus, a question-oriented cross-modal co-attention network is proposed. The network consists of a multimodal feature extraction module, a question-oriented cross-modal co-attention module, feature fusion module and classifier. The extracted image and the question features are respectively output with weighted attention features after passing through layers of attention. After the linear fusion of features, it is fed into the SoftMax classifier to obtain the predictive answer to the question; Finally, combined with the counting module, the counting ability of the model is improved. The results show that the model performs well on the public data set VQA v2.0, and obtains an overall classification accuracy of 70.71 % and 70.78% on the test_dev and test_std, respectively, which shows some advantages compared with most advanced models.
What problem does this paper attempt to address?