Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Yuxuan Qiao,Haodong Duan,Xinyu Fang,Junming Yang,Lin Chen,Songyang Zhang,Jiaqi Wang,Dahua Lin,Kai Chen
2024-06-21
Abstract:Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties. Assessing these two competencies independently is crucial for model refinement, despite the inherent difficulty due to the intertwined nature of seeing and reasoning in existing VLMs. To tackle this issue, we present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information using a Large Language Model (LLM). This modular design enables the systematic comparison and assessment of both proprietary and open-source VLM for their perception and reasoning strengths. Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks. By combining a streamlined VLM focused on perception with a powerful LLM tailored for reasoning, Prism achieves superior results in general vision-language tasks while substantially cutting down on training and operational expenses. Quantitative evaluations show that Prism, when configured with a vanilla 2B LLaVA and freely accessible GPT-3.5, delivers performance on par with VLMs $10 \times$ larger on the rigorous multimodal benchmark MMStar. The project is released at: <a class="link-external link-https" href="https://github.com/SparksJoe/Prism" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
This paper mainly explores how to evaluate and decompose the perception and reasoning abilities of Visual Language Models (VLMs). Existing VLMs often intertwine perception (extracting information from images) and reasoning (generating answers based on the extracted information) when dealing with visual problems, making it difficult to independently evaluate the capabilities of these two aspects. To address this issue, the paper proposes a framework called Prism, which decomposes the process of solving visual problems into two independent stages: the perception stage using VLM to extract and textualize image information, and the reasoning stage using large-scale language models (LLMs) to generate answers based on the extracted visual information. Through Prism, researchers can separately test the performance of different VLMs in terms of perception and reasoning. The experimental results show that dedicated VLMs such as GPT-4v perform well in perception, while the perception capability of open-source VLMs is less dependent on the size of the language model but the reasoning capability may be limited by its scale. The paper also finds that combining a small VLM (focused on perception) and a powerful LLM (focused on reasoning) can reduce training and running costs while ensuring the performance of visual language tasks. Furthermore, the Prism framework is not only used for evaluation but also serves as an efficient visual language task solver. By training a VLM with approximately 2 billion parameters as a visual describer and integrating it with a powerful LLM, comparable performance to larger-scale VLMs can be achieved while reducing resource requirements. Prism outperforms many open-source VLMs on the multimodal benchmark test MMStar, particularly in problems involving reasoning. In conclusion, the main contribution of the paper is the proposal of the Prism framework, which allows for the decomposition of perception and reasoning and serves to evaluate the capabilities of existing VLMs, along with providing a strategy for optimizing visual language tasks.