A focus fusion attention mechanism integrated with image captions for knowledge graph-based visual question answering

Mingyang Ma,Turdi Tohti,Yi Liang,Zicheng Zuo,Askar Hamdulla
DOI: https://doi.org/10.1007/s11760-024-03013-7
IF: 1.583
2024-02-14
Signal Image and Video Processing
Abstract:Visual question answering tasks based on the knowledge graph are dedicated to integrating rich information in the knowledge graph to deal with complex questions that cannot be solved by image features alone while focusing on improving the performance of fundamental visual question answering tasks. The core of this task is to achieve effective cross-modal information fusion and resolve the semantic gap between images and text, thereby predicting answers more accurately. However, current visual question answering methods face challenges such as sparse information, single fusion features, and excessive computational burden. Given the sparsity of image regions related to questions in visual question answering tasks, traditional fusion methods such as linear pooling and cross-attention, while capable of effectively handling interactions between different modalities, engage the question with the entire image globally. It introduces unnecessary noise and increases computational complexity. To solve these problems, we propose a focus fusion attention mechanism (FFAM) integrated with image captions, effectively reducing noise and computational burden by focusing on the topk high-relevance areas. In addition, we adopt the advanced BLIP-2 model to generate image captions and introduce it as a new modality into the fusion process, breaking through the limitation of relying solely on features generated by the image encoder. Although introducing the knowledge graph increases the possibility of model processing complexity and noise, our method still shows powerful effects. On the F-VQA dataset, our model improved by 2.57% compared to the baseline model without the knowledge graph and achieved an accuracy of 86.35% with the knowledge graph.
engineering, electrical & electronic,imaging science & photographic technology
What problem does this paper attempt to address?