Evaluating Vision-Language Models on Bistable Images

Artemis Panagopoulou,Coby Melkin,Chris Callison-Burch
2024-05-30
Abstract:Bistable images, also known as ambiguous or reversible images, present visual stimuli that can be seen in two distinct interpretations, though not simultaneously by the observer. In this study, we conduct the most extensive examination of vision-language models using bistable images to date. We manually gathered a dataset of 29 bistable images, along with their associated labels, and subjected them to 116 different manipulations in brightness, tint, and rotation. We evaluated twelve different models in both classification and generative tasks across six model architectures. Our findings reveal that, with the exception of models from the Idefics family and LLaVA1.5-13b, there is a pronounced preference for one interpretation over another among the models, and minimal variance under image manipulations, with few exceptions on image rotations. Additionally, we compared the model preferences with humans, noting that the models do not exhibit the same continuity biases as humans and often diverge from human initial interpretations. We also investigated the influence of variations in prompts and the use of synonymous labels, discovering that these factors significantly affect model interpretations more than image manipulations showing a higher influence of the language priors on bistable image interpretations compared to image-text training data. All code and data is open sourced.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper mainly explores the behavior of Vision-Language Models (VLMs) in handling bistable images (images that can have two different interpretations). The researchers collected 29 bistable images, applied a series of visual transformations, and evaluated 12 different models for classification and generation tasks. They found that most models, except for the Idefics family and the LLaV A1.5-13b model, tend to favor one interpretation and have low sensitivity to image transformations such as brightness, hue, and rotation. Compared to humans, the models do not exhibit the same level of continuity bias and language prior has a greater impact on the models than image-text training data when dealing with bistable images. Additionally, suggested changes and the use of synonymous labels significantly affect the models' interpretations, indicating the importance of language priors in bistable image interpretation. The paper also highlights the differences between VLMs and traditional visual models in handling visual ambiguity.