Abstract:Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties. Assessing these two competencies independently is crucial for model refinement, despite the inherent difficulty due to the intertwined nature of seeing and reasoning in existing VLMs. To tackle this issue, we present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information using a Large Language Model (LLM). This modular design enables the systematic comparison and assessment of both proprietary and open-source VLM for their perception and reasoning strengths. Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks. By combining a streamlined VLM focused on perception with a powerful LLM tailored for reasoning, Prism achieves superior results in general vision-language tasks while substantially cutting down on training and operational expenses. Quantitative evaluations show that Prism, when configured with a vanilla 2B LLaVA and freely accessible GPT-3.5, delivers performance on par with VLMs $10 \times$ larger on the rigorous multimodal benchmark MMStar. The project is released at: <a class="link-external link-https" href="https://github.com/SparksJoe/Prism" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper mainly explores how to evaluate and decompose the perception and reasoning abilities of Visual Language Models (VLMs). Existing VLMs often intertwine perception (extracting information from images) and reasoning (generating answers based on the extracted information) when dealing with visual problems, making it difficult to independently evaluate the capabilities of these two aspects. To address this issue, the paper proposes a framework called Prism, which decomposes the process of solving visual problems into two independent stages: the perception stage using VLM to extract and textualize image information, and the reasoning stage using large-scale language models (LLMs) to generate answers based on the extracted visual information. Through Prism, researchers can separately test the performance of different VLMs in terms of perception and reasoning. The experimental results show that dedicated VLMs such as GPT-4v perform well in perception, while the perception capability of open-source VLMs is less dependent on the size of the language model but the reasoning capability may be limited by its scale. The paper also finds that combining a small VLM (focused on perception) and a powerful LLM (focused on reasoning) can reduce training and running costs while ensuring the performance of visual language tasks. Furthermore, the Prism framework is not only used for evaluation but also serves as an efficient visual language task solver. By training a VLM with approximately 2 billion parameters as a visual describer and integrating it with a powerful LLM, comparable performance to larger-scale VLMs can be achieved while reducing resource requirements. Prism outperforms many open-source VLMs on the multimodal benchmark test MMStar, particularly in problems involving reasoning. In conclusion, the main contribution of the paper is the proposal of the Prism framework, which allows for the decomposition of perception and reasoning and serves to evaluate the capabilities of existing VLMs, along with providing a strategy for optimizing visual language tasks.

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Visually Descriptive Language Model for Vector Graphics Reasoning

Smart Vision-Language Reasoners

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Large Language Models are Visual Reasoning Coordinators

Rethinking Pruning for Vision-Language Models: Strategies for Effective Sparsity and Performance Restoration

Visually-Augmented Language Modeling

Prompting Large Language Models with Fine-Grained Visual Relations from Scene Graph for Visual Question Answering

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models