Abstract:Knowledge-based Visual Question Answering (KVQA) requires both image and world knowledge to answer questions. Current methods first retrieve knowledge from the image and external knowledge base with the original complex question, then generate answers with Large Language Models (LLMs). However, since the original question contains complex elements that require knowledge from different sources, acquiring different kinds of knowledge in a coupled manner may confuse models and hinder them from retrieving precise knowledge. Furthermore, the ``forward-only'' answering process fails to explicitly capture the knowledge needs of LLMs, which can further hurt answering quality. To cope with the above limitations, we propose DKA: Disentangled Knowledge Acquisition from LLM feedback, a training-free framework that disentangles knowledge acquisition to avoid confusion and uses LLM's feedback to specify the required knowledge. Specifically, DKA requires LLMs to specify what knowledge they need to answer the question and decompose the original complex question into two simple sub-questions: Image-based sub-question and Knowledge-based sub-question. Then we use the two sub-questions to retrieve knowledge from the image and knowledge base, respectively. In this way, two knowledge acquisition models can focus on the content that corresponds to them and avoid disturbance of irrelevant elements in the original complex question, which can help to provide more precise knowledge and better align the knowledge needs of LLMs to yield correct answers. Experiments on benchmark datasets show that DKA significantly outperforms SOTA models. To facilitate future research, our data and code are available at \url{<a class="link-external link-https" href="https://github.com/Lackel/DKA" rel="external noopener nofollow">this https URL</a>}.

Knowledge Generation for Zero-shot Knowledge-based VQA

Zero-Shot Visual Question Answering Using Knowledge Graph

Benchmarking Knowledge-driven Zero-shot Learning

LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection

K-VQG: Knowledge-aware Visual Question Generation for Common-sense Acquisition

Zero-shot Visual Question Answering with Language Model Feedback

Zero-shot and Few-shot Learning with Knowledge Graphs: A Comprehensive Survey

Knowledge-Augmented Visual Question Answering With Natural Language Explanation

Diversify, Rationalize, and Combine: Ensembling Multiple QA Strategies for Zero-shot Knowledge-based VQA

Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance.

Knowledge-aware Zero-Shot Learning: Survey and Perspective

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Knowledge Condensation and Reasoning for Knowledge-based VQA

K-ZSL: Resources for Knowledge-driven Zero-shot Learning

ZVQAF: Zero-shot visual question answering with feedback from large language models

Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models

Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering

Zero-Resource Knowledge-Grounded Dialogue Generation

Good Questions Help Zero-Shot Image Reasoning

Understanding and Improving Zero-shot Multi-hop Reasoning in Generative Question Answering