Evaluating Vision-Language Models on Bistable Images

Artemis Panagopoulou,Coby Melkin,Chris Callison-Burch

2024-05-30

Abstract:Bistable images, also known as ambiguous or reversible images, present visual stimuli that can be seen in two distinct interpretations, though not simultaneously by the observer. In this study, we conduct the most extensive examination of vision-language models using bistable images to date. We manually gathered a dataset of 29 bistable images, along with their associated labels, and subjected them to 116 different manipulations in brightness, tint, and rotation. We evaluated twelve different models in both classification and generative tasks across six model architectures. Our findings reveal that, with the exception of models from the Idefics family and LLaVA1.5-13b, there is a pronounced preference for one interpretation over another among the models, and minimal variance under image manipulations, with few exceptions on image rotations. Additionally, we compared the model preferences with humans, noting that the models do not exhibit the same continuity biases as humans and often diverge from human initial interpretations. We also investigated the influence of variations in prompts and the use of synonymous labels, discovering that these factors significantly affect model interpretations more than image manipulations showing a higher influence of the language priors on bistable image interpretations compared to image-text training data. All code and data is open sourced.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

This paper mainly explores the behavior of Vision-Language Models (VLMs) in handling bistable images (images that can have two different interpretations). The researchers collected 29 bistable images, applied a series of visual transformations, and evaluated 12 different models for classification and generation tasks. They found that most models, except for the Idefics family and the LLaV A1.5-13b model, tend to favor one interpretation and have low sensitivity to image transformations such as brightness, hue, and rotation. Compared to humans, the models do not exhibit the same level of continuity bias and language prior has a greater impact on the models than image-text training data when dealing with bistable images. Additionally, suggested changes and the use of synonymous labels significantly affect the models' interpretations, indicating the importance of language priors in bistable image interpretation. The paper also highlights the differences between VLMs and traditional visual models in handling visual ambiguity.

Evaluating Vision-Language Models on Bistable Images

Picturing Ambiguity: A Visual Twist on the Winograd Schema Challenge

A Vision Check-up for Language Models

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

Vision language models are blind

Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

Vision-Language Models under Cultural and Inclusive Considerations

Revisiting the Role of Language Priors in Vision-Language Models

Language-Based Image Editing with Recurrent Attentive Models

Learning the Visualness of Text Using Large Vision-Language Models

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

An Introduction to Vision-Language Modeling

Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?

Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!