Abstract:With the breakthrough of multi-modal large language models, answering complex visual questions that demand advanced reasoning abilities and world knowledge has become a much more important testbed for developing AI models than ever. However, equipping AI models with robust cross-modality reasoning ability remains challenging since the cognition scheme of humans has not been understood systematically. In this paper, we believe that if we can collect visual clues in the given image as much as possible, we will recognize the image more accurately, understand the question better, recall relevant knowledge more easily, and finally reason out the answer. We discover these rich visual clues by mining question-answer pairs in images and sending them into multi-modal large language models as prompts. We call the proposed method Q&A Prompts. Specifically, we first use the image-answer pairs and the corresponding questions in the training set as inputs and outputs to train a visual question generation model. Then, we use an image tagging model to identify various instances and send packaged image-tag pairs into the visual question generation model to generate relevant questions with the extracted image tags as answers. Finally, we encode these generated question-answer pairs as prompts with a visual-aware prompting module and send them into pre-trained multi-modal large language models to reason out the final answers. Experimental results show that, compared with state-of-the-art methods, our Q&A Prompts achieves substantial improvements on the challenging visual question answering datasets requiring reasoning over diverse world knowledge, such as OK-VQA and A-OKVQA.

Diverse Visual Question Generation based on Multiple Objects Selection

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Simple and Effective Visual Question Answering in a Single Modality

Visual Question Generation Under Multi-granularity Cross-Modal Interaction.

A Question Type Driven Framework to Diversify Visual Question Generation

Generating Natural Questions from Images for Multimodal Assistants

ConVQG: Contrastive Visual Question Generation with Multimodal Guidance

Exploring Diverse Methods in Visual Question Answering

Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference

Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge

Learning to Generate Visual Questions with Noisy Supervision

Question-Guided Semantic Dual-Graph Visual Reasoning with Novel Answers.

Diversifying Question Generation over Knowledge Base via External Natural Questions

Information Maximizing Visual Question Generation

Visual question answering: A survey of methods and datasets

Multi-Question Learning for Visual Question Answering

VQA-GEN: A Visual Question Answering Benchmark for Domain Generalization

Diverse and Specific Clarification Question Generation with Keywords

Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance.

Visual Question Generation for Class Acquisition of Unknown Objects

Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation