Abstract:With the breakthrough of multi-modal large language models, answering complex visual questions that demand advanced reasoning abilities and world knowledge has become a much more important testbed for developing AI models than ever. However, equipping AI models with robust cross-modality reasoning ability remains challenging since the cognition scheme of humans has not been understood systematically. In this paper, we believe that if we can collect visual clues in the given image as much as possible, we will recognize the image more accurately, understand the question better, recall relevant knowledge more easily, and finally reason out the answer. We discover these rich visual clues by mining question-answer pairs in images and sending them into multi-modal large language models as prompts. We call the proposed method Q&A Prompts. Specifically, we first use the image-answer pairs and the corresponding questions in the training set as inputs and outputs to train a visual question generation model. Then, we use an image tagging model to identify various instances and send packaged image-tag pairs into the visual question generation model to generate relevant questions with the extracted image tags as answers. Finally, we encode these generated question-answer pairs as prompts with a visual-aware prompting module and send them into pre-trained multi-modal large language models to reason out the final answers. Experimental results show that, compared with state-of-the-art methods, our Q&A Prompts achieves substantial improvements on the challenging visual question answering datasets requiring reasoning over diverse world knowledge, such as OK-VQA and A-OKVQA.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper attempts to address the issue of how to perform complex reasoning and apply world knowledge in Visual Question Answering (VQA) tasks using Multimodal Large Language Models (MLLMs). Specifically, the paper focuses on tasks that require diverse world knowledge and complex reasoning chains, such as the questions in the OK-VQA and A-OKVQA datasets. Traditional VQA methods perform well on simple perception tasks, but when it comes to questions that require diverse world knowledge and complex logical reasoning, existing methods (including state-of-the-art MLLMs) often perform poorly. The paper proposes a new method called Q&A Prompts, which generates question-answer pairs by mining rich visual cues from images and uses these pairs as prompts input to MLLMs to enhance the model's reasoning capabilities. ### Main Contributions 1. **Proposed a new VQA framework**: This framework effectively enhances the reasoning capabilities of multimodal large language models by generating and utilizing question-answer pairs as prompts. This method explicitly collects rich visual cues, bridging the logical gap between perception and reasoning. 2. **Designed a new question-answer prompt generation scheme**: This scheme combines a Visual Question Generation (VQG) model and an image tagging model to generate Q&A prompts related to recognizable objects, scenes, and activities in the image. 3. **Introduced a new visual perception prompt module**: This module efficiently encodes the generated Q&A prompts for use in subsequent reasoning processes. 4. **Experimental validation**: Experiments were conducted on the more challenging OK-VQA and A-OKVQA datasets, showing that Q&A Prompts significantly improved the reasoning capabilities of MLLMs, with models like InstructBLIP, LLaVA, and MiniGPT-4 showing notable performance improvements on these datasets. ### Experimental Results - On the A-OKVQA dataset, Q&A Prompts achieved an accuracy of 69.4% on the validation set and 68.1% on the test set. - On the OK-VQA dataset, Q&A Prompts achieved an accuracy of 64.3%. - Compared to existing state-of-the-art methods, Q&A Prompts showed a significant improvement on A-OKVQA, with increases of 5.4% and 6.0%, and a 2.2% improvement on OK-VQA. ### Examples The paper provides several specific examples demonstrating how Q&A Prompts help the model better understand images and questions to arrive at the correct answers. For instance: - **Question**: In which room of the house is this man? - **Image Tags**: mirror, sink, reflection, tiled wall, tie, bathroom accessories, gaze, faucet, cabinet - **Q&A Prompts**: - Q: What is this man looking at? A: Mirror - Q: Where does this man wash his hands? A: Sink - Q: Where are the bathroom accessories placed? A: Cabinet - **Predicted Answers**: - InstructBLIP: Bedroom - BLIP-2: Bedroom - Our method: Bathroom - **Ground Truth**: Bathroom These examples illustrate that Q&A Prompts can significantly enhance the model's performance in complex VQA tasks.

Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Pgcl: Prompt Guidance and Self-Supervised Contrastive Learning-Based Method for Visual Question Answering

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions

Perceptual Visual Reasoning with Knowledge Propagation

Visual Question Answering for Intelligent Interaction

Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering

Visual Question Answering by Pattern Matching and Reasoning

Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering.

Multimodal Prompt Retrieval for Generative Visual Question Answering

Prompting Large Language Models with Fine-Grained Visual Relations from Scene Graph for Visual Question Answering

AI-VQA

See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning

Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts

Prompting Vision Language Model with Knowledge from Large Language Model for Knowledge-Based VQA

Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts

Prompting Large Language Models with Rationale Heuristics for Knowledge-based Visual Question Answering

Inferential Visual Question Generation

Improving reasoning with contrastive visual information for visual question answering

AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability