VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models

Nam Hyeon-Woo,Moon Ye-Bin,Wonseok Choi,Lee Hyun,Tae-Hyun Oh
2024-09-23
Abstract:Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks; however, our understanding of their visual perception remains limited. In this work, we propose an eye examination process to investigate how a VLM perceives images, specifically focusing on key elements of visual recognition, from primitive color and shape to semantic levels. To this end, we introduce a dataset named LENS to guide a VLM to follow the examination and check its readiness. Once the model is ready, we conduct the examination. Through this examination, we quantify and visualize VLMs' sensitivities to color and shape, and semantic matching. Our findings reveal that VLMs have varying sensitivity to different colors while consistently showing insensitivity to green across different VLMs. Also, we found different shape sensitivity and semantic recognition depending on LLM's capacity despite using the same fixed visual encoder. Our analyses and findings have potential to inspire the design of VLMs and the pre-processing of visual input to VLMs for improving application performance.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore how Vision Language Models (VLMs) perceive and recognize key elements in images, from basic colors and shapes to the semantic level. Specifically, the author proposes an "eye - check" process to evaluate the visual capabilities of VLMs and introduces a dataset named LENS to guide and check the readiness of these models. #### Main research questions include: 1. **How do VLMs perceive images**: - Research the VLMs' understanding ability of colors, shapes and semantics. - Explore the differences in sensitivity among different VLMs in these aspects. 2. **Understand the visual perception mechanism of VLMs**: - Analyze the VLMs' sensitivity to different colors, especially why they are not sensitive to green. - Research the VLMs' sensitivity to shape changes and its relationship with the model size. 3. **Improve the application performance of VLMs**: - Propose a pre - processing method based on model sensitivity to improve the performance of VLMs in specific tasks (such as chart understanding). 4. **Explain the behavior of VLMs**: - Design experiments to explain the behavior of VLMs when processing visual information, especially how their decision - making process is affected by the capacity of Language Models (LLMs). ### Method overview To achieve the above goals, the author designed a three - step "eye - check" process: 1. **Instruction stage**: Use the LENS dataset to fine - tune VLMs, so that the model knows how to conduct an eye - check. 2. **Preparation stage**: Evaluate whether the model is ready for an eye - check through the LENS test set. 3. **Eye - check stage**: Evaluate the model's sensitivity to colors, shapes and semantics through a series of questions. Through this process, the author can quantify and visualize the sensitivity of VLMs to different visual elements, reveal the differences between different models, and provide valuable insights for future design and pre - processing. ### Key findings - **Color sensitivity**: VLMs are sensitive to red and blue, but not to green. This is contrary to human visual perception. - **Shape sensitivity**: Larger - scale VLMs are more sensitive to shape changes. - **Semantic sensitivity**: Larger - scale VLMs perform better in semantic recognition, especially more accurate in processing background areas. These findings not only help to understand the working principle of VLMs, but also provide a theoretical basis for improving the application performance of these models.