Abstract:Knowledge-based Visual Question Answering (KVQA) requires both image and world knowledge to answer questions. Current methods first retrieve knowledge from the image and external knowledge base with the original complex question, then generate answers with Large Language Models (LLMs). However, since the original question contains complex elements that require knowledge from different sources, acquiring different kinds of knowledge in a coupled manner may confuse models and hinder them from retrieving precise knowledge. Furthermore, the ``forward-only'' answering process fails to explicitly capture the knowledge needs of LLMs, which can further hurt answering quality. To cope with the above limitations, we propose DKA: Disentangled Knowledge Acquisition from LLM feedback, a training-free framework that disentangles knowledge acquisition to avoid confusion and uses LLM's feedback to specify the required knowledge. Specifically, DKA requires LLMs to specify what knowledge they need to answer the question and decompose the original complex question into two simple sub-questions: Image-based sub-question and Knowledge-based sub-question. Then we use the two sub-questions to retrieve knowledge from the image and knowledge base, respectively. In this way, two knowledge acquisition models can focus on the content that corresponds to them and avoid disturbance of irrelevant elements in the original complex question, which can help to provide more precise knowledge and better align the knowledge needs of LLMs to yield correct answers. Experiments on benchmark datasets show that DKA significantly outperforms SOTA models. To facilitate future research, our data and code are available at \url{<a class="link-external link-https" href="https://github.com/Lackel/DKA" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to acquire and utilize knowledge more accurately to improve the answer quality in the Knowledge - based Visual Question Answering (KVQA) task. Specifically, current methods retrieve knowledge from both images and external knowledge bases when dealing with complex questions, which may lead to model confusion and failure to accurately retrieve the required knowledge. Moreover, existing methods only guess what knowledge large - language models (LLMs) need through the forward process without explicitly specifying these requirements, which may further affect the quality of the answers. To address these issues, the paper proposes DKA (Disentangled Knowledge Acquisition from LLM feedback), a training - free framework that improves the accuracy of answers by decoupling the knowledge - acquisition process and using the feedback from LLMs to specify the required knowledge.

Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models

A Framework of Knowledge Graph-Enhanced Large Language Model Based on Question Decomposition and Atomic Retrieval

LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection

Enhancing Large Language Models with Knowledge Graphs for Robust Question Answering

Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA

Retrieve-Rewrite-Answer: A KG-to-Text Enhanced LLMs Framework for Knowledge Graph Question Answering

Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models

ChatKBQA: A Generate-then-Retrieve Framework for Knowledge Base Question Answering with Fine-tuned Large Language Models

Knowledge Condensation and Reasoning for Knowledge-based VQA

Boosting Visual Question Answering with Context-aware Knowledge Aggregation

Retrieval and Reasoning on KGs: Integrate Knowledge Graphs into Large Language Models for Complex Question Answering

Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance.

Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

Enhancing Large Language Models with Pseudo- and Multisource- Knowledge Graphs for Open-ended Question Answering

LB-KBQA: Large-language-model and BERT based Knowledge-Based Question and Answering System

Knowledge-Augmented Visual Question Answering With Natural Language Explanation

Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets

KG-EGV: A Framework for Question Answering with Integrated Knowledge Graphs and Large Language Models

Multi-Modal Validation and Domain Interaction Learning for Knowledge-based Visual Question Answering