Question-Led Object Attention for Visual Question Answering

Lianli Gao,Liangfu Cao,Xing Xu,Jie Shao,Jingkuan Song
DOI: https://doi.org/10.1016/j.neucom.2018.11.102
IF: 6
2020-01-01
Neurocomputing
Abstract:Question plays a leading role for Visual Question Answering (VQA) because it specifies the particular visual objects or conjures vivid visual that the machine should attend. However, existing approaches predominantly predict the answer by utilizing the question and the whole image without considering the leading role of the question. Also, recent object spatial inference is usually conducted on pixel level instead of object level. Therefore, we propose a novel but simple framework, namely Question-Led Object Attention (QLOB), to improve the VQA performance by exploring question semantics, fine-grained object information, and the relationship between those two modalities. First, we extract sentence semantics by a question model, and utilize the efficient object detection network to obtain a global visual feature and local features from top r object region proposals. Second, our QLOB attention mechanism selects those question-related object regions. Third, we optimize question model and QLOB attention by a softmax classifier to predict the final answer. Extensive experimental results on three public VQA datasets demonstrate that our QLOB outperforms the state-of-the-arts.
What problem does this paper attempt to address?