Learning to Supervise Knowledge Retrieval over a Tree Structure for Visual Question Answering
Ning Xu,Zimu Lu,Hongshuo Tian,Rongbao Kang,Jinbo Cao,Yongdong Zhang,An-An Liu
DOI: https://doi.org/10.1109/tmm.2024.3355638
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Knowledge-based visual question answering (KBVQA) aims to retrieve the external knowledge out of images to answer questions. However, current methods always introduce various irrelevant knowledge due to two drawbacks: (1) Synonymy issue. Existing methods heavily rely on words from questions or object labels in images to match knowledge from databases, which disregards the same word may hold multiple meanings within different contexts. (2) Knowledge uncertainty issue. Due to the absence of supervisory signals, recent methods can not determine which knowledge is applicable for answer inference, which can mislead to admit useless knowledge. To address these two problems, we propose to supervise the process of knowledge retrieval over a tree structure for KB-VQA task. For the synonymy issue, we construct a hierarchical knowledge tree to capture the subordination information between knowledge facts, mitigating the impact of synonyms on knowledge retrieval. For the knowledge uncertainty issue, we use the retrieval history as the ground truth to supervise the knowledge retrieval, which facilitates the QA model to form an explicit path of knowledge facts for answer understanding. Finally, we integrate the image, question, and retrieved knowledge into a variant of transformer to predict answers. Experimental results validate the effectiveness of the proposed method on KR-VQA, OK-VQA and VQA v2 datasets.
computer science, information systems,telecommunications, software engineering