Abstract:Complex visual reasoning and question answering (VQA) is a challenging task that requires compositional multi-step processing and higher-level reasoning capabilities beyond the immediate recognition and localization of objects and events. Here, we introduce a fully neural Iterative and Parallel Reasoning Mechanism (IPRM) that combines two distinct forms of computation -- iterative and parallel -- to better address complex VQA scenarios. Specifically, IPRM's "iterative" computation facilitates compositional step-by-step reasoning for scenarios wherein individual operations need to be computed, stored, and recalled dynamically (e.g. when computing the query "determine the color of pen to the left of the child in red t-shirt sitting at the white table"). Meanwhile, its "parallel" computation allows for the simultaneous exploration of different reasoning paths and benefits more robust and efficient execution of operations that are mutually independent (e.g. when counting individual colors for the query: "determine the maximum occurring color amongst all t-shirts"). We design IPRM as a lightweight and fully-differentiable neural module that can be conveniently applied to both transformer and non-transformer vision-language backbones. It notably outperforms prior task-specific methods and transformer-based attention modules across various image and video VQA benchmarks testing distinct complex reasoning capabilities such as compositional spatiotemporal reasoning (AGQA), situational reasoning (STAR), multi-hop reasoning generalization (CLEVR-Humans) and causal event linking (CLEVRER-Humans). Further, IPRM's internal computations can be visualized across reasoning steps, aiding interpretability and diagnosis of its errors.

ViperGPT: Visual Inference via Python Execution for Reasoning

Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

CLEVR-POC: Reasoning-Intensive Visual Question Answering in Partially Observable Environments

Perceptual Visual Reasoning with Knowledge Propagation

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

Modular Visual Question Answering via Code Generation

MindGPT: Interpreting What You See with Non-invasive Brain Recordings

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"

Pyramid Coder: Hierarchical Code Generator for Compositional Visual Question Answering

Learning Visual Reasoning Without Strong Priors

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Recursive Visual Programming

Toward Accurate Visual Reasoning with Dual-Path Neural Module Networks.

PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns

Assessing GPT4-V on Structured Reasoning Tasks

VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks

VDebugger: Harnessing Execution Feedback for Debugging Visual Programs

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

Joint Answering and Explanation for Visual Commonsense Reasoning