Abstract:Recently, to comprehensively improve Vision Language Models (VLMs) for Visual Question Answering (VQA), several methods have been proposed to further reinforce the inference capabilities of VLMs to independently tackle VQA tasks rather than some methods that only utilize VLMs as aids to Large Language Models (LLMs). However, these methods ignore the rich common-sense knowledge inside the given VQA image sampled from the real world. Thus, they cannot fully use the powerful VLM for the given VQA question to achieve optimal performance. Attempt to overcome this limitation and inspired by the human top-down reasoning process, i.e., systematically exploring relevant issues to derive a comprehensive answer, this work introduces a novel, explainable multi-agent collaboration framework by leveraging the expansive knowledge of Large Language Models (LLMs) to enhance the capabilities of VLMs themselves. Specifically, our framework comprises three agents, i.e., Responder, Seeker, and Integrator, to collaboratively answer the given VQA question by seeking its relevant issues and generating the final answer in such a top-down reasoning process. The VLM-based Responder agent generates the answer candidates for the question and responds to other relevant issues. The Seeker agent, primarily based on LLM, identifies relevant issues related to the question to inform the Responder agent and constructs a Multi-View Knowledge Base (MVKB) for the given visual scene by leveraging the build-in world knowledge of LLM. The Integrator agent combines knowledge from the Seeker agent and the Responder agent to produce the final VQA answer. Extensive and comprehensive evaluations on diverse VQA datasets with a variety of VLMs demonstrate the superior performance and interpretability of our framework over the baseline method in the zero-shot setting without extra training cost.

MERGE: Multi-Entity Relational Reasoning Based Explanation in Visual Question Answering

Entity-Relation Extraction As Multi-Turn Question Answering

Simple and Effective Visual Question Answering in a Single Modality

Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA

VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions

Improving VQA and its Explanations \\ by Comparing Competing Explanations

Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering

MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network

R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering.

A Study on Multimodal and Interactive Explanations for Visual Question Answering

Context-aware Multi-level Question Embedding Fusion for visual question answering

Towards Reasoning-Aware Explainable VQA

Multi-scale Relation Reasoning for Multi-Modal Visual Question Answering.

Relational reasoning and adaptive fusion for visual question answering

Knowledge-Augmented Visual Question Answering With Natural Language Explanation

Joint Answering and Explanation for Visual Commonsense Reasoning

The Impact of Explanations on AI Competency Prediction in VQA

An effective spatial relational reasoning networks for visual question answering

Robust Explanations for Visual Question Answering

VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning

Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering