Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance.

Jianfeng Wang,Anda Zhang,Huifang Du,Haofen Wang,Wenqiang Zhang
DOI: https://doi.org/10.1145/3579051.3579073
2022-01-01
Abstract:Visual Question Answering (VQA) can facilitate social convenience, which needs to study complex joint reasoning in the visual and language over external knowledge. Recently, Knowledge-Based VQA has attracted the attention of researchers. There are many sources of external knowledge, including visual, textual, and commonsense knowledge, which can effectively improve the reasoning ability of the VQA model. However, introducing different knowledge sources increases the probability of retrieving irrelevant facts and generating noise and further impacts the model’s performance. Existing approaches use contrast and prompt learning, visual matrices, density retrieval, etc., to address the noise but bring complex processes. Furthermore, the knowledge representation in these approaches is limited to specific knowledge forms, such as the triple of the knowledge graphs. To address the challenges, we propose a multi-modal joint-guided (MMJG) external knowledge introduction method. The method is to select more relevant external knowledge to the current question through the attention of multi-modal information. Unlike any existing method, our approach learns an adaptive selection module to select external knowledge that is more relevant to the question. Our approach is not specific to a particular knowledge form. The comparison and ablation experiments on the benchmark dataset show that our method achieves better results and demonstrates that our method is more effective.
What problem does this paper attempt to address?