Local Relation Network with Multilevel Attention for Visual Question Answering

Bo Sun,Zeng Yao,Yinghui Zhang,Lejun Yu
DOI: https://doi.org/10.1016/j.jvcir.2020.102762
IF: 2.887
2020-01-01
Journal of Visual Communication and Image Representation
Abstract:With the tremendous success of the visual question answering (VQA) tasks, visual attention mechanisms have become an indispensable part of VQA models. However, these attention-based methods do not consider any relationship among regions, which is crucial for the thorough understanding of the image by the model. We propose local relation networks for generating context-aware image features for each image region, which contain information on the relationship among the other image regions. Furthermore, we propose a multilevel attention mechanism to combine semantic information from the LRNs and the original image regions, rendering the decision of the model more reasonable. With these two measures, we improve the region representation and achieve better attentive effect and VQA performance. We conduct numerous experiments on the COCO-QA dataset and the largest VQA v2.0 benchmark dataset. Our model achieves competitive results, proving the effectiveness of our proposed LRNs and multilevel attention mechanism through visual demonstrations. (C) 2020 Published by Elsevier Inc.
What problem does this paper attempt to address?