Abstract:Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks; however, our understanding of their visual perception remains limited. In this work, we propose an eye examination process to investigate how a VLM perceives images, specifically focusing on key elements of visual recognition, from primitive color and shape to semantic levels. To this end, we introduce a dataset named LENS to guide a VLM to follow the examination and check its readiness. Once the model is ready, we conduct the examination. Through this examination, we quantify and visualize VLMs' sensitivities to color and shape, and semantic matching. Our findings reveal that VLMs have varying sensitivity to different colors while consistently showing insensitivity to green across different VLMs. Also, we found different shape sensitivity and semantic recognition depending on LLM's capacity despite using the same fixed visual encoder. Our analyses and findings have potential to inspire the design of VLMs and the pre-processing of visual input to VLMs for improving application performance.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore how Vision Language Models (VLMs) perceive and recognize key elements in images, from basic colors and shapes to the semantic level. Specifically, the author proposes an "eye - check" process to evaluate the visual capabilities of VLMs and introduces a dataset named LENS to guide and check the readiness of these models. #### Main research questions include: 1. **How do VLMs perceive images**: - Research the VLMs' understanding ability of colors, shapes and semantics. - Explore the differences in sensitivity among different VLMs in these aspects. 2. **Understand the visual perception mechanism of VLMs**: - Analyze the VLMs' sensitivity to different colors, especially why they are not sensitive to green. - Research the VLMs' sensitivity to shape changes and its relationship with the model size. 3. **Improve the application performance of VLMs**: - Propose a pre - processing method based on model sensitivity to improve the performance of VLMs in specific tasks (such as chart understanding). 4. **Explain the behavior of VLMs**: - Design experiments to explain the behavior of VLMs when processing visual information, especially how their decision - making process is affected by the capacity of Language Models (LLMs). ### Method overview To achieve the above goals, the author designed a three - step "eye - check" process: 1. **Instruction stage**: Use the LENS dataset to fine - tune VLMs, so that the model knows how to conduct an eye - check. 2. **Preparation stage**: Evaluate whether the model is ready for an eye - check through the LENS test set. 3. **Eye - check stage**: Evaluate the model's sensitivity to colors, shapes and semantics through a series of questions. Through this process, the author can quantify and visualize the sensitivity of VLMs to different visual elements, reveal the differences between different models, and provide valuable insights for future design and pre - processing. ### Key findings - **Color sensitivity**: VLMs are sensitive to red and blue, but not to green. This is contrary to human visual perception. - **Shape sensitivity**: Larger - scale VLMs are more sensitive to shape changes. - **Semantic sensitivity**: Larger - scale VLMs perform better in semantic recognition, especially more accurate in processing background areas. These findings not only help to understand the working principle of VLMs, but also provide a theoretical basis for improving the application performance of these models.

VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models

Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models

Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Vision-Language Models for Vision Tasks: A Survey

A Vision Check-up for Language Models

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

An Introduction to Vision-Language Modeling

VLind-Bench: Measuring Language Priors in Large Vision-Language Models

Are VLMs Really Blind

How Well Can Vision Language Models See Image Details?

Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis

VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained Evaluation

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models

Vision-Language Models under Cultural and Inclusive Considerations

EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models