Modularized Zero-shot VQA with Pre-trained Models

Rui Cao,Jing Jiang
DOI: https://doi.org/10.48550/arXiv.2305.17369
2024-01-24
Abstract:Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In this paper, we study how to leverage them for zero-shot visual question answering (VQA). Our approach is motivated by a few observations. First, VQA questions often require multiple steps of reasoning, which is still a capability that most PTMs lack. Second, different steps in VQA reasoning chains require different skills such as object detection and relational reasoning, but a single PTM may not possess all these skills. Third, recent work on zero-shot VQA does not explicitly consider multi-step reasoning chains, which makes them less interpretable compared with a decomposition-based approach. We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable. We convert sub reasoning tasks to acceptable objectives of PTMs and assign tasks to proper PTMs without any adaptation. Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method and better interpretability compared with several baselines.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use large - scale pre - trained models (PTMs) for visual question answering (VQA) in a zero - shot setting. Specifically, the author aims to overcome the following challenges: 1. **The need for multi - step reasoning**: Many VQA questions require multiple reasoning steps to arrive at an answer, while most existing pre - trained models lack this ability. 2. **Different reasoning steps require different skills**: For example, skills such as object detection and relational reasoning, but a single pre - trained model may not possess all of these skills. 3. **Limitations of existing zero - shot VQA methods**: Current zero - shot VQA methods do not explicitly consider multi - step reasoning chains, resulting in their poor interpretability. To solve these problems, the author proposes a modular zero - shot network (Mod - Zero - VQA), which explicitly decomposes the question into sub - reasoning steps and assigns these subtasks to appropriate pre - trained models, thereby achieving better performance and higher interpretability. ### Main contributions of the paper: 1. **Propose a novel modular zero - shot VQA method** that uses different pre - trained models to handle different reasoning steps. 2. **Design rules** to map different VQA reasoning steps to suitable pre - trained models, without the need for any adaptation of these models. 3. **Experimental results show** that the proposed method significantly outperforms the baseline model on problems requiring multi - step reasoning, with a relative improvement of nearly 13% in accuracy (from 41.9 to 47.3). 4. **Improve the interpretability of the model** by generating explicit reasoning steps, making each reasoning step clearly visible. ### Method overview: - **Modular decomposition**: Decompose complex VQA questions into multiple basic reasoning steps. - **Task assignment**: Map these reasoning steps to different pre - trained models (such as OWL, MDETR, and CLIP) according to predefined rules. - **Spatial heuristics**: Introduce simple spatial heuristic rules to assist pre - trained models in spatial relational reasoning. This method not only improves the performance of zero - shot VQA, but also enhances the interpretability and flexibility of the model.