Latent Attention Network With Position Perception for Visual Question Answering

Jing Zhang,Xiaoqiang Liu,Zhe Wang
DOI: https://doi.org/10.1109/tnnls.2024.3377636
IF: 14.255
2024-01-01
IEEE Transactions on Neural Networks and Learning Systems
Abstract:For exploring the complex relative position relationships among multiobject with multiple position prepositions in the question, we propose a novel latent attention (LA) network for visual question answering (VQA), in which LA with position perception is extracted by a novel LA generation module (LAGM) and encoded along with absolute and relative position relations by our proposed position-aware module (PAM). The LAGM reconstructs original attention into LA by capturing the tendency of visual attention shifting according to the position prepositions in the question. The LA accurately captures the complex relative position features of multiple objects and helps the model locate the attention to the correct object or region. The PAM adopts latent state and relative position relations to enhance the capability of comprehending the multiobject correlations. In addition, we also propose a novel gated counting module (GCM) to strengthen the sensitivity of quantitative knowledge for effectively improving the performance of counting questions. Extensive experiments demonstrate that our proposed method achieves excellent performance on VQA and outperforms state-of-the-art methods on the widely used datasets VQA v2 and VQA v1.
computer science, artificial intelligence, theory & methods,engineering, electrical & electronic, hardware & architecture
What problem does this paper attempt to address?