VisMin: Visual Minimal-Change Understanding

Rabiul Awal,Saba Ahmadi,Le Zhang,Aishwarya Agrawal
2024-07-24
Abstract:Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). Existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar \textit{captions} given an image. In this paper, we introduce a new, challenging benchmark termed \textbf{Vis}ual \textbf{Min}imal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. The image pair and caption pair contain minimal changes, i.e., only one aspect changes at a time from among the following: \textit{object}, \textit{attribute}, \textit{count}, and \textit{spatial relation}. These changes test the models' understanding of objects, attributes (such as color, material, shape), counts, and spatial relationships between objects. We built an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. We also generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks and in CLIP's general image-text alignment. We release all resources, including the benchmark, training data, and finetuned model checkpoints, at \url{<a class="link-external link-https" href="https://vismin.net/" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the ability of vision - language models (VLMs) in nuanced understanding, especially whether these models can accurately distinguish between two very similar images and match the correct images and descriptions when facing them. Existing benchmark tests mainly focus on evaluating the VLMs' ability to distinguish between two similar text descriptions, but ignore the understanding of subtle changes in images. Therefore, this paper proposes a new and challenging benchmark test - Visual Minimal - Change Understanding (VisMin) to evaluate the VLMs' ability to understand subtle changes in images. ### Specific Problem Description 1. **Limitations of Existing Benchmark Tests**: - There are differences in multiple aspects (such as objects, attributes, backgrounds, etc.) between the original images and their negative samples in existing benchmark tests such as Winoground and MMVP. This limits the difficulty of the benchmark tests and makes it difficult to accurately evaluate the model's fine - grained understanding of specific aspects. - Some benchmark tests such as EQBEN and SPEC have well - controlled negative samples, but the visual field is limited to images generated by graphic engines or simple scenes, lacking complexity and diversity. 2. **Requirements for New Benchmark Tests**: - A benchmark test that can accurately evaluate the VLMs' ability to understand subtle changes in images is needed, especially changes in objects, attributes, quantities, and spatial relationships. - The new benchmark test should be based on complex daily - scene images to ensure naturalness and authenticity. 3. **Design Goals of the VisMin Benchmark Test**: - By introducing pairs of images that change in only one aspect (such as objects, attributes, quantities, or spatial relationships), evaluate the VLMs' ability to understand these subtle changes. - Provide an automated data generation framework and ensure data quality through a strict four - step manual verification process. - Generate large - scale training data sets for fine - tuning existing VLMs to improve their fine - grained understanding ability. ### Solutions 1. **VisMin Benchmark Test**: - It includes four types of minimal changes: objects, attributes, quantities, and spatial relationships. - Only one aspect changes between each pair of images and description pairs, while other aspects remain unchanged. - Generate large - scale data through an automated pipeline and ensure data quality through strict manual verification. 2. **Fine - Tuning VLMs**: - Use the generated minimal - change data sets to fine - tune VLMs such as CLIP and Idefics2. - The experimental results show that the fine - tuned models show significant improvements in multiple benchmark tests, especially in fine - grained understanding and image - text alignment. ### Summary This paper aims to fill the gap in existing benchmark tests in evaluating the VLMs' ability to understand subtle changes in images by introducing the VisMin benchmark test, and improve the performance of VLMs in this area through large - scale data generation and fine - tuning techniques.