A Knowledge-Enhanced Inferential Network for Cross-Modality Multi-hop VQA

Shiqi Wang,Jianxing Yu,Miaopei Lin,Shuang Qiu,Xi Luo,Jian Yin
DOI: https://doi.org/10.1007/978-3-031-46664-9_41
2023-01-01
Abstract:This paper focuses on cross-modality multi-hop visual questions, which require multi-hop reasoning over different sources of knowledge from multiple modalities, such as image and text. Due to the lack of cross-modality reasoning ability, it is difficult for the traditional model to make correct predictions. To solve this problem, we propose a new knowledge-enhanced inferential framework. We first build a reasoning graph to capture the topological relations between the objects in the given image and the logical relations of entities corresponding to these objects. To align the visual objects and textual entities, we design a cross-modality retriever with the help of an external multimodal knowledge graph. Based on the logical and topological relations on the graph, we can derive the answer by decomposing a complex multi-hop question into a series of attention-based reasoning steps. The result of the previous step acts as the context of the next step. By linking the results of all steps, we can form an evidence chain to the answer. Extensive experiments conducted on the popular KVQA dataset demonstrate the effectiveness of our approach.
What problem does this paper attempt to address?