Abstract:Knowledge-based visual question answering (KVQA) has been extensively studied to answer visual questions with external knowledge, e.g., knowledge graphs (KGs). While several attempts have been proposed to leverage large language models (LLMs) as an implicit knowledge source, it remains challenging since LLMs may generate hallucinations. Moreover, multiple knowledge sources, e.g., images, KGs and LLMs, cannot be readily aligned for complex scenarios. To tackle these, we present a novel modality-aware integration with LLMs for KVQA (MAIL). It carefully leverages multimodal knowledge for both image understanding and knowledge reasoning. Specifically, (i) we propose a two-stage prompting strategy with LLMs to densely embody the image into a scene graph with detailed visual features; (ii) We construct a coupled concept graph by linking the mentioned entities with external facts. (iii) A tailored pseudo-siamese graph medium fusion is designed for sufficient multimodal fusion. We utilize the shared mentioned entities in two graphs as mediums to bridge a tight inter-modal exchange, while maximally preserving insightful intra-modal learning by constraining the fusion within mediums. Extensive experiments on two benchmark datasets show the superiority of MAIL with 24x less resources.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively utilize the knowledge of large - language models (LLMs) to enhance image understanding and question - reasoning ability in knowledge - driven visual question answering (KVQA). Specifically, existing methods face the following challenges when using LLMs: 1. **LLMs May Generate Hallucinations**: Directly asking questions to LLMs may lead to inaccurate answers or unreliable reasoning evidence, especially when dealing with complex or domain - specific questions. 2. **Difficulty in Integrating Multimodal Knowledge**: Existing methods usually simply stitch together information from different modalities (such as images, knowledge graphs, and LLMs) for reasoning. This approach lacks necessary cross - modal communication and limits the final reasoning performance. To solve these problems, the paper proposes a new modality - aware integration framework with LLMs for KVQA (MAIL for short). This framework improves the performance of the KVQA task in the following aspects: 1. **Two - stage Prompting Strategy**: First, prompt the visual LLMs to generate a detailed scene graph containing rich visual features; then extract the entities in the scene and their relationships to form a scene graph. 2. **Coupled Concept Map Construction**: Link the entities in the scene graph with facts in an external knowledge graph (such as ConceptNet) to form a coupled concept map to support knowledge reasoning. 3. **Fusion in Pseudo - Siamese Graph**: Design a pseudo - Siamese graph medium fusion algorithm (PS - GMF). By using shared entities as a medium, it achieves sufficient multimodal fusion while maximizing the retention of internal information in each modality. Through these methods, MAIL can more effectively utilize the knowledge of LLMs and improve the accuracy and reasoning ability of the KVQA task. Experimental results show that MAIL significantly outperforms multiple existing baseline models on two benchmark datasets and also performs well in terms of resource efficiency.

Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Enhancing Large Language Models with Knowledge Graphs for Robust Question Answering

Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets

Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance.

Prompting Vision Language Model with Knowledge from Large Language Model for Knowledge-Based VQA

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant

Enhancing Large Language Models with Pseudo- and Multisource- Knowledge Graphs for Open-ended Question Answering

Find The Gap: Knowledge Base Reasoning For Visual Question Answering

An Interactive Multi-modal Query Answering System with Retrieval-Augmented Large Language Models

Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models

Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks

Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts

Multi-Modal Validation and Domain Interaction Learning for Knowledge-based Visual Question Answering

Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey

Visual Question Answering reasoning with external knowledge based on bimodal graph neural network

Multimodal Reasoning with Multimodal Knowledge Graph

Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering