Abstract:With the breakthrough of multi-modal large language models, answering complex visual questions that demand advanced reasoning abilities and world knowledge has become a much more important testbed for developing AI models than ever. However, equipping AI models with robust cross-modality reasoning ability remains challenging since the cognition scheme of humans has not been understood systematically. In this paper, we believe that if we can collect visual clues in the given image as much as possible, we will recognize the image more accurately, understand the question better, recall relevant knowledge more easily, and finally reason out the answer. We discover these rich visual clues by mining question-answer pairs in images and sending them into multi-modal large language models as prompts. We call the proposed method Q&A Prompts. Specifically, we first use the image-answer pairs and the corresponding questions in the training set as inputs and outputs to train a visual question generation model. Then, we use an image tagging model to identify various instances and send packaged image-tag pairs into the visual question generation model to generate relevant questions with the extracted image tags as answers. Finally, we encode these generated question-answer pairs as prompts with a visual-aware prompting module and send them into pre-trained multi-modal large language models to reason out the final answers. Experimental results show that, compared with state-of-the-art methods, our Q&A Prompts achieves substantial improvements on the challenging visual question answering datasets requiring reasoning over diverse world knowledge, such as OK-VQA and A-OKVQA.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve
This paper attempts to address the issue of how to perform complex reasoning and apply world knowledge in Visual Question Answering (VQA) tasks using Multimodal Large Language Models (MLLMs). Specifically, the paper focuses on tasks that require diverse world knowledge and complex reasoning chains, such as the questions in the OK-VQA and A-OKVQA datasets.
Traditional VQA methods perform well on simple perception tasks, but when it comes to questions that require diverse world knowledge and complex logical reasoning, existing methods (including state-of-the-art MLLMs) often perform poorly. The paper proposes a new method called Q&A Prompts, which generates question-answer pairs by mining rich visual cues from images and uses these pairs as prompts input to MLLMs to enhance the model's reasoning capabilities.
### Main Contributions
1. **Proposed a new VQA framework**: This framework effectively enhances the reasoning capabilities of multimodal large language models by generating and utilizing question-answer pairs as prompts. This method explicitly collects rich visual cues, bridging the logical gap between perception and reasoning.
2. **Designed a new question-answer prompt generation scheme**: This scheme combines a Visual Question Generation (VQG) model and an image tagging model to generate Q&A prompts related to recognizable objects, scenes, and activities in the image.
3. **Introduced a new visual perception prompt module**: This module efficiently encodes the generated Q&A prompts for use in subsequent reasoning processes.
4. **Experimental validation**: Experiments were conducted on the more challenging OK-VQA and A-OKVQA datasets, showing that Q&A Prompts significantly improved the reasoning capabilities of MLLMs, with models like InstructBLIP, LLaVA, and MiniGPT-4 showing notable performance improvements on these datasets.
### Experimental Results
- On the A-OKVQA dataset, Q&A Prompts achieved an accuracy of 69.4% on the validation set and 68.1% on the test set.
- On the OK-VQA dataset, Q&A Prompts achieved an accuracy of 64.3%.
- Compared to existing state-of-the-art methods, Q&A Prompts showed a significant improvement on A-OKVQA, with increases of 5.4% and 6.0%, and a 2.2% improvement on OK-VQA.
### Examples
The paper provides several specific examples demonstrating how Q&A Prompts help the model better understand images and questions to arrive at the correct answers. For instance:
- **Question**: In which room of the house is this man?
- **Image Tags**: mirror, sink, reflection, tiled wall, tie, bathroom accessories, gaze, faucet, cabinet
- **Q&A Prompts**:
- Q: What is this man looking at? A: Mirror
- Q: Where does this man wash his hands? A: Sink
- Q: Where are the bathroom accessories placed? A: Cabinet
- **Predicted Answers**:
- InstructBLIP: Bedroom
- BLIP-2: Bedroom
- Our method: Bathroom
- **Ground Truth**: Bathroom
These examples illustrate that Q&A Prompts can significantly enhance the model's performance in complex VQA tasks.