Jointly Learning Attentions With Semantic Cross-Modal Correlation For Visual Question Answering

Liangfu Cao,Lianli Gao,Jingkuan Song,Xing Xu,Heng Tao Shen
DOI: https://doi.org/10.1007/978-3-319-68155-9_19
2017-01-01
Abstract:Visual Question Answering (VQA) has emerged as a prominent multi-discipline research problem in artificial intelligence. A number of recent studies are focusing on proposing attention mechanisms such as visual attention ("where to look") or question attention ("what words to listen to"), and they have been proved to be efficient for VQA. However, they focus on modeling the prediction error, but ignore the semantic correlation between image attention and question attention. As a result, it will inevitably result in suboptimal attentions. In this paper, we argue that in addition to modeling visual and question attentions, it is equally important to model their semantic correlation to learn them jointly as well as to facilitate their joint representation learning for VQA. In this paper, we propose a novel end-to-end model to jointly learn attentions with semantic cross-modal correlation for efficiently solving the VQA problem. Specifically, we propose a multi-modal embedding to map the visual and question attentions into a joint space to guarantee their semantic consistency. Experimental results on the benchmark datasets demonstrate that our model outperforms several state-of-the-art techniques for VQA.
What problem does this paper attempt to address?