Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis

Aishik Nagar,Shantanu Jaiswal,Cheston Tan

2024-08-27

Abstract:Vision-language models (VLMs) have shown impressive zero- and few-shot performance on real-world visual question answering (VQA) benchmarks, alluding to their capabilities as visual reasoning engines. However, the benchmarks being used conflate "pure" visual reasoning with world knowledge, and also have questions that involve a limited number of reasoning steps. Thus, it remains unclear whether a VLM's apparent visual reasoning performance is due to its world knowledge, or due to actual visual reasoning capabilities. To clarify this ambiguity, we systematically benchmark and dissect the zero-shot visual reasoning capabilities of VLMs through synthetic datasets that require minimal world knowledge, and allow for analysis over a broad range of reasoning steps. We focus on two novel aspects of zero-shot visual reasoning: i) evaluating the impact of conveying scene information as either visual embeddings or purely textual scene descriptions to the underlying large language model (LLM) of the VLM, and ii) comparing the effectiveness of chain-of-thought prompting to standard prompting for zero-shot visual reasoning. We find that the underlying LLMs, when provided textual scene descriptions, consistently perform better compared to being provided visual embeddings. In particular, 18% higher accuracy is achieved on the PTR dataset. We also find that CoT prompting performs marginally better than standard prompting only for the comparatively large GPT-3.5-Turbo (175B) model, and does worse for smaller-scale models. This suggests the emergence of CoT abilities for visual reasoning in LLMs at larger scales even when world knowledge is limited. Overall, we find limitations in the abilities of VLMs and LLMs for more complex visual reasoning, and highlight the important role that LLMs can play in visual reasoning.

Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

The paper primarily explores the performance of Visual Language Models (VLMs) in zero-shot visual reasoning tasks and attempts to address the following key questions: 1. **Clarifying the relationship between visual reasoning ability and world knowledge**: Current Visual Question Answering (VQA) benchmarks confuse pure visual reasoning ability with world knowledge, making it difficult to determine whether the performance improvement of VLMs is due to their world knowledge or actual visual reasoning ability. 2. **Evaluating the effectiveness of different information representation methods**: The study examines the difference in zero-shot visual reasoning performance of VLMs when scene information is conveyed to foundational Large Language Models (LLMs) in the form of visual embeddings or pure text descriptions. 3. **Comparing the effectiveness of Chain-of-Thought (CoT) prompting versus standard prompting**: The analysis focuses on the effectiveness of CoT prompting compared to standard prompting in zero-shot visual reasoning tasks, particularly the performance differences across models of varying scales. By using synthetic datasets (such as CLEVR and PTR), the authors systematically evaluated the zero-shot visual reasoning capabilities of VLMs and found that foundational LLMs generally perform better when provided with pure text scene descriptions compared to visual embeddings. Additionally, for larger-scale models (such as GPT-3.5-Turbo), CoT prompting performs better on certain tasks, but it is less effective for smaller-scale models. These findings help to better understand the limitations and potential improvement directions for VLMs and LLMs in complex visual reasoning tasks.

Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis

Good Questions Help Zero-Shot Image Reasoning

Large Language Models are Zero-Shot Reasoners

Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks

Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models

From Images to Textual Prompts: Zero-Shot Visual Question Answering with Frozen Large Language Models

Are VLMs Really Blind

Zero-shot Visual Question Answering with Language Model Feedback

How Far Are We from Intelligent Visual Deductive Reasoning?

Exploring the Zero-Shot Capabilities of Vision-Language Models for Improving Gaze Following

Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Understanding Graphical Perception in Data Visualization through Zero-shot Prompting of Vision-Language Models

Smart Vision-Language Reasoners

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions