Intra-Modality Feature Interaction Using Self-attention for Visual Question Answering.

Huan Shao,Yunlong Xu,Yi Ji,Jianyu Yang,Chunping Liu
DOI: https://doi.org/10.1007/978-3-030-36802-9_24
2019-01-01
Abstract:Better capturing the interactions of different modality is a hot research topic in visual question answering (VQA) recently. Inspired by human vision information processing, a method of VQA based on intra-modality features interactive with self-attention mechanism (IMFI-SA) is proposed. We adopted object-level features with bottom-up attention instead of feature mapping to extract the fine-grained information in images. Moreover, the interactions of intra-modality in the question and the image modality is also extracted by proposed IMFI-SA model respectively. Finally, we combined the enhanced object-level features interaction using top-down cross-attention and the question features interaction to predict the answer given a question and image. Experimental results on the VQA2.0 dataset show that the proposed method is superior to the existing method in the reasoning answer generating, especially in counting problems.
What problem does this paper attempt to address?