Abstract:Zero-shot Visual Question Answering (VQA) is a prominent vision-language task that examines both the visual and textual understanding capability of systems in the absence of training data. Recently, by converting the images into captions, information across multi-modalities is bridged and Large Language Models (LLMs) can apply their strong zero-shot generalization capability to unseen questions. To design ideal prompts for solving VQA via LLMs, several studies have explored different strategies to select or generate question-answer pairs as the exemplar prompts, which guide LLMs to answer the current questions effectively. However, they totally ignore the role of question prompts. The original questions in VQA tasks usually encounter ellipses and ambiguity which require intermediate reasoning. To this end, we present Reasoning Question Prompts for VQA tasks, which can further activate the potential of LLMs in zero-shot scenarios. Specifically, for each question, we first generate self-contained questions as reasoning question prompts via an unsupervised question edition module considering sentence fluency, semantic integrity and syntactic invariance. Each reasoning question prompt clearly indicates the intent of the original question. This results in a set of candidate answers. Then, the candidate answers associated with their confidence scores acting as answer heuristics are fed into LLMs and produce the final answer. We evaluate reasoning question prompts on three VQA challenges, experimental results demonstrate that they can significantly improve the results of LLMs on zero-shot setting and outperform existing state-of-the-art zero-shot methods on three out of four data sets. Our source code is publicly released at \url{https://github.com/ECNU-DASE-NLP/RQP}.

A Region-based Document VQA

Simple and Effective Visual Question Answering in a Single Modality

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Towards Complex Document Understanding by Discrete Reasoning

PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering

PDFVQA: A New Dataset for Real-World VQA on PDF Documents

Beyond OCR + VQA: Towards End-to-End Reading and Reasoning for Robust and Accurate TextVQA

A survey on VQA_Datasets and Approaches

Weakly-Supervised 3D Spatial Reasoning for Text-based Visual Question Answering

Convincing Rationales for Visual Question Answering Reasoning

R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering.

DuReadervis: A Chinese Dataset for Open-domain Document Visual Question Answering

REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering

Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering.

An effective spatial relational reasoning networks for visual question answering

Joint Answering and Explanation for Visual Commonsense Reasoning

Learning neighbor-enhanced region representations and question-guided visual representations for visual question answering

AI-VQA

CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes

AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts