Abstract:Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In this paper, we study how to leverage them for zero-shot visual question answering (VQA). Our approach is motivated by a few observations. First, VQA questions often require multiple steps of reasoning, which is still a capability that most PTMs lack. Second, different steps in VQA reasoning chains require different skills such as object detection and relational reasoning, but a single PTM may not possess all these skills. Third, recent work on zero-shot VQA does not explicitly consider multi-step reasoning chains, which makes them less interpretable compared with a decomposition-based approach. We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable. We convert sub reasoning tasks to acceptable objectives of PTMs and assign tasks to proper PTMs without any adaptation. Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method and better interpretability compared with several baselines.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to use large - scale pre - trained models (PTMs) for visual question answering (VQA) in a zero - shot setting. Specifically, the author aims to overcome the following challenges: 1. **The need for multi - step reasoning**: Many VQA questions require multiple reasoning steps to arrive at an answer, while most existing pre - trained models lack this ability. 2. **Different reasoning steps require different skills**: For example, skills such as object detection and relational reasoning, but a single pre - trained model may not possess all of these skills. 3. **Limitations of existing zero - shot VQA methods**: Current zero - shot VQA methods do not explicitly consider multi - step reasoning chains, resulting in their poor interpretability. To solve these problems, the author proposes a modular zero - shot network (Mod - Zero - VQA), which explicitly decomposes the question into sub - reasoning steps and assigns these subtasks to appropriate pre - trained models, thereby achieving better performance and higher interpretability. ### Main contributions of the paper: 1. **Propose a novel modular zero - shot VQA method** that uses different pre - trained models to handle different reasoning steps. 2. **Design rules** to map different VQA reasoning steps to suitable pre - trained models, without the need for any adaptation of these models. 3. **Experimental results show** that the proposed method significantly outperforms the baseline model on problems requiring multi - step reasoning, with a relative improvement of nearly 13% in accuracy (from 41.9 to 47.3). 4. **Improve the interpretability of the model** by generating explicit reasoning steps, making each reasoning step clearly visible. ### Method overview: - **Modular decomposition**: Decompose complex VQA questions into multiple basic reasoning steps. - **Task assignment**: Map these reasoning steps to different pre - trained models (such as OWL, MDETR, and CLIP) according to predefined rules. - **Spatial heuristics**: Introduce simple spatial heuristic rules to assist pre - trained models in spatial relational reasoning. This method not only improves the performance of zero - shot VQA, but also enhances the interpretability and flexibility of the model.

Modularized Zero-shot VQA with Pre-trained Models

Simple and Effective Visual Question Answering in a Single Modality

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Exploring Question Decomposition for Zero-Shot VQA

Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering

Zero-Shot Visual Question Answering Using Knowledge Graph

Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Zero-Shot Cross-Lingual Knowledge Transfer in VQA Via Multimodal Distillation

Zero-Shot Transfer VQA Dataset

Good Questions Help Zero-Shot Image Reasoning

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

Zero-shot Visual Question Answering with Language Model Feedback

Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis

Modular Visual Question Answering via Code Generation

Question Guided Modular Routing Networks for Visual Question Answering

Multitask Learning for Visual Question Answering

LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering

Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language