Learning neighbor-enhanced region representations and question-guided visual representations for visual question answering
Ling Gao,Hongda Zhang,Nan Sheng,Lida Shi,Hao Xu
DOI: https://doi.org/10.1016/j.eswa.2023.122239
IF: 8.5
2023-10-27
Expert Systems with Applications
Abstract:Great strides have been made in visual question answering field (VQA) based on the application and development of deep learning in related research fields. Existing models in this field focus on the learning and fusion of visual and textual features. However, it is extremely crucial for VQA tasks to focus on the associations between image regions and use question information to enhance key features. In this paper, we propose a method for mining and integrating neighbor-enhanced region representations and question-guided visual representations. Particularly, the region feature graph is first constructed to integrate the features of all regions and the relationships between regions. Secondly, a random walk-based method is presented to acquire the neighbor-enhanced region representations, which combines the topological relationships of all region nodes in the graph. The question-guided vertical and horizontal dual attention mechanism is then proposed to enhance the region representation from the region level and the feature level, respectively. Finally, the enhanced region representation and question representation are integrated adaptively to achieve answer prediction. Convincible experiments show that our method achieves improvements and outperforms prior state-of-the-art methods on two competitive benchmarks, i.e., VQA v1 and VQA v2.
computer science, artificial intelligence,engineering, electrical & electronic,operations research & management science