Multi-Modal Validation and Domain Interaction Learning for Knowledge-based Visual Question Answering

Ning Xu,Yifei Gao,An-An Liu,Hongshuo Tian,Yongdong Zhang
DOI: https://doi.org/10.1109/tkde.2024.3384270
2024-01-01
Abstract:Knowledge-based Visual Question Answering (KB-VQA) aims to answer the image-aware question via the external knowledge, which requires an agent to not only understand images but also explicitly retrieve and integrate knowledge facts. Intuitively, to accurately answer the question, we humans can validate the retrieved knowledge based on our memory, and then align the knowledge facts with the image regions to infer answers. However, most existing methods ignore the process of knowledge validation and alignment. In this paper, we propose the Multi-Modal Validation and Domain Interaction Learning method, which consists of two components: 1) Multi-modal validation for knowledge retrieval. We propose the multi-modal validation module (MMV) to evaluate the confidence of each retrieved knowledge fact via images and questions, which preserves knowledge candidates effective for inferring answers. 2) Domain interaction for knowledge integration. We propose the Domain Interaction TRansformer module (DI-TR) to align visual regions with knowledge facts by the interaction learning in the improved transformer. Specifically, the inter-domain and intra-domain masks are injected into each self-attention layer to control the integration scope. The proposed method outperforms several strong baselines on three widely-used knowledge-based datasets: KRVQA, OK-VQA and VQA2.0. Extensive experiments and ablation studies demonstrate the effectiveness of multi-modal knowledge validation and domain interaction learning.
What problem does this paper attempt to address?