Abstract:Despite the importance of shape perception in human vision, early neural image classifiers relied less on shape information for object recognition than other (often spurious) features. While recent research suggests that current large Vision-Language Models (VLMs) exhibit more reliance on shape, we find them to still be seriously limited in this regard. To quantify such limitations, we introduce IllusionBench, a dataset that challenges current cutting-edge VLMs to decipher shape information when the shape is represented by an arrangement of visual elements in a scene. Our extensive evaluations reveal that, while these shapes are easily detectable by human annotators, current VLMs struggle to recognize them, indicating important avenues for future work in developing more robust visual perception systems. The full dataset and codebase are available at: \url{<a class="link-external link-https" href="https://arshiahemmat.github.io/illusionbench/" rel="external noopener nofollow">this https URL</a>}

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiency in shape recognition ability of current vision - language models (VLMs). Although shape perception is crucial for the human visual system, early deep image classifiers relied more on other features such as texture rather than shape information for object recognition. Recent research shows that although current large - scale VLMs have improved in shape recognition, there are still significant limitations. Specifically, the paper points out the following key issues: 1. **Limitations of existing datasets**: Previous datasets used to evaluate shape recognition (such as Cue Conflict and Stylized - ImageNet) have some problems, for example, lack of coherent, natural visual scenes, loss of shape information, and poor quality of style transfer. 2. **Limited shape recognition ability of VLMs**: Even in the latest large - scale VLMs, they are still difficult to accurately recognize abstract shapes composed of visual elements, especially in complex natural scenes. This indicates that there is still much room for improvement in these models when dealing with shape information. To solve these problems, the author introduced a new benchmark dataset - IllusionBench. This dataset represents shape information by generating natural - scene images containing visually - element - complex arrangements and evaluates the ability of VLMs to recognize shapes in this case. The experimental results show that although human annotators can easily recognize these shapes, most VLMs perform poorly in recognizing these shapes, showing their deficiency in shape perception. ### Summary This paper aims to reveal and quantify the limitations of current VLMs in shape recognition by introducing new datasets and evaluation methods, thus providing directions for the development of more powerful visual perception systems in the future.

Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

ShapeGlot: Learning Language for Shape Differentiation

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Analyzing the Roles of Language and Vision in Learning from Limited Data

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models

Are VLMs Really Blind

What's "up" with vision-language models? Investigating their struggle with spatial reasoning

Vision language models are blind

Contributions of Shape, Texture, and Color in Visual Recognition

Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions

Evaluating Multiview Object Consistency in Humans and Image Models

How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities

Teaching deep networks to see shape: Lessons from a simplified visual world

Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

VLind-Bench: Measuring Language Priors in Large Vision-Language Models

Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset

A Vision Check-up for Language Models