Ques-to-Visual Guided Visual Question Answering.

Xiangyu Wu,Jianfeng Lu,Zhuanfeng Li,Fengchao Xiong
DOI: https://doi.org/10.1109/icip46576.2022.9897277
2022-01-01
Abstract:Visual question answering (VQA) answers text-based questions about images. The difficulty of VQA lies in the accurate localization of the region related to the question. In this paper, we introduce the ques-to-visual (q2v) feature as the additional input of VQA to tackle this problem. The q2v feature is generated according to the semantics of the question, containing visual semantics that is helpful to locate the region related to the question. We then use self-attention to model the intra-relationship in each modality to enhance different features, i.e., q2v, image, and text features. The enhanced features are then fused by spatial guided-attention and multi-scale channel attention modules for the answer prediction. Experimental results on the VQA2.0 benchmark dataset show that our method achieves higher performance when compared with other methods.
What problem does this paper attempt to address?