Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering
Junnan Dong,Qinggang Zhang,Huachi Zhou,Daochen Zha,Pai Zheng,Xiao Huang
2024-02-20
Abstract:Knowledge-based visual question answering (KVQA) has been extensively studied
to answer visual questions with external knowledge, e.g., knowledge graphs
(KGs). While several attempts have been proposed to leverage large language
models (LLMs) as an implicit knowledge source, it remains challenging since
LLMs may generate hallucinations. Moreover, multiple knowledge sources, e.g.,
images, KGs and LLMs, cannot be readily aligned for complex scenarios. To
tackle these, we present a novel modality-aware integration with LLMs for KVQA
(MAIL). It carefully leverages multimodal knowledge for both image
understanding and knowledge reasoning. Specifically, (i) we propose a two-stage
prompting strategy with LLMs to densely embody the image into a scene graph
with detailed visual features; (ii) We construct a coupled concept graph by
linking the mentioned entities with external facts. (iii) A tailored
pseudo-siamese graph medium fusion is designed for sufficient multimodal
fusion. We utilize the shared mentioned entities in two graphs as mediums to
bridge a tight inter-modal exchange, while maximally preserving insightful
intra-modal learning by constraining the fusion within mediums. Extensive
experiments on two benchmark datasets show the superiority of MAIL with 24x
less resources.
Computation and Language,Information Retrieval,Computer Vision and Pattern Recognition,Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively utilize the knowledge of large - language models (LLMs) to enhance image understanding and question - reasoning ability in knowledge - driven visual question answering (KVQA). Specifically, existing methods face the following challenges when using LLMs:
1. **LLMs May Generate Hallucinations**: Directly asking questions to LLMs may lead to inaccurate answers or unreliable reasoning evidence, especially when dealing with complex or domain - specific questions.
2. **Difficulty in Integrating Multimodal Knowledge**: Existing methods usually simply stitch together information from different modalities (such as images, knowledge graphs, and LLMs) for reasoning. This approach lacks necessary cross - modal communication and limits the final reasoning performance.
To solve these problems, the paper proposes a new modality - aware integration framework with LLMs for KVQA (MAIL for short). This framework improves the performance of the KVQA task in the following aspects:
1. **Two - stage Prompting Strategy**: First, prompt the visual LLMs to generate a detailed scene graph containing rich visual features; then extract the entities in the scene and their relationships to form a scene graph.
2. **Coupled Concept Map Construction**: Link the entities in the scene graph with facts in an external knowledge graph (such as ConceptNet) to form a coupled concept map to support knowledge reasoning.
3. **Fusion in Pseudo - Siamese Graph**: Design a pseudo - Siamese graph medium fusion algorithm (PS - GMF). By using shared entities as a medium, it achieves sufficient multimodal fusion while maximizing the retention of internal information in each modality.
Through these methods, MAIL can more effectively utilize the knowledge of LLMs and improve the accuracy and reasoning ability of the KVQA task. Experimental results show that MAIL significantly outperforms multiple existing baseline models on two benchmark datasets and also performs well in terms of resource efficiency.