Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models

Arshia Hemmat,Adam Davies,Tom A. Lamb,Jianhao Yuan,Philip Torr,Ashkan Khakzar,Francesco Pinto
2024-11-10
Abstract:Despite the importance of shape perception in human vision, early neural image classifiers relied less on shape information for object recognition than other (often spurious) features. While recent research suggests that current large Vision-Language Models (VLMs) exhibit more reliance on shape, we find them to still be seriously limited in this regard. To quantify such limitations, we introduce IllusionBench, a dataset that challenges current cutting-edge VLMs to decipher shape information when the shape is represented by an arrangement of visual elements in a scene. Our extensive evaluations reveal that, while these shapes are easily detectable by human annotators, current VLMs struggle to recognize them, indicating important avenues for future work in developing more robust visual perception systems. The full dataset and codebase are available at: \url{<a class="link-external link-https" href="https://arshiahemmat.github.io/illusionbench/" rel="external noopener nofollow">this https URL</a>}
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiency in shape recognition ability of current vision - language models (VLMs). Although shape perception is crucial for the human visual system, early deep image classifiers relied more on other features such as texture rather than shape information for object recognition. Recent research shows that although current large - scale VLMs have improved in shape recognition, there are still significant limitations. Specifically, the paper points out the following key issues: 1. **Limitations of existing datasets**: Previous datasets used to evaluate shape recognition (such as Cue Conflict and Stylized - ImageNet) have some problems, for example, lack of coherent, natural visual scenes, loss of shape information, and poor quality of style transfer. 2. **Limited shape recognition ability of VLMs**: Even in the latest large - scale VLMs, they are still difficult to accurately recognize abstract shapes composed of visual elements, especially in complex natural scenes. This indicates that there is still much room for improvement in these models when dealing with shape information. To solve these problems, the author introduced a new benchmark dataset - IllusionBench. This dataset represents shape information by generating natural - scene images containing visually - element - complex arrangements and evaluates the ability of VLMs to recognize shapes in this case. The experimental results show that although human annotators can easily recognize these shapes, most VLMs perform poorly in recognizing these shapes, showing their deficiency in shape perception. ### Summary This paper aims to reveal and quantify the limitations of current VLMs in shape recognition by introducing new datasets and evaluation methods, thus providing directions for the development of more powerful visual perception systems in the future.