Abstract:Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). Existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar \textit{captions} given an image. In this paper, we introduce a new, challenging benchmark termed \textbf{Vis}ual \textbf{Min}imal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. The image pair and caption pair contain minimal changes, i.e., only one aspect changes at a time from among the following: \textit{object}, \textit{attribute}, \textit{count}, and \textit{spatial relation}. These changes test the models' understanding of objects, attributes (such as color, material, shape), counts, and spatial relationships between objects. We built an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. We also generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks and in CLIP's general image-text alignment. We release all resources, including the benchmark, training data, and finetuned model checkpoints, at \url{<a class="link-external link-https" href="https://vismin.net/" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the ability of vision - language models (VLMs) in nuanced understanding, especially whether these models can accurately distinguish between two very similar images and match the correct images and descriptions when facing them. Existing benchmark tests mainly focus on evaluating the VLMs' ability to distinguish between two similar text descriptions, but ignore the understanding of subtle changes in images. Therefore, this paper proposes a new and challenging benchmark test - Visual Minimal - Change Understanding (VisMin) to evaluate the VLMs' ability to understand subtle changes in images. ### Specific Problem Description 1. **Limitations of Existing Benchmark Tests**: - There are differences in multiple aspects (such as objects, attributes, backgrounds, etc.) between the original images and their negative samples in existing benchmark tests such as Winoground and MMVP. This limits the difficulty of the benchmark tests and makes it difficult to accurately evaluate the model's fine - grained understanding of specific aspects. - Some benchmark tests such as EQBEN and SPEC have well - controlled negative samples, but the visual field is limited to images generated by graphic engines or simple scenes, lacking complexity and diversity. 2. **Requirements for New Benchmark Tests**: - A benchmark test that can accurately evaluate the VLMs' ability to understand subtle changes in images is needed, especially changes in objects, attributes, quantities, and spatial relationships. - The new benchmark test should be based on complex daily - scene images to ensure naturalness and authenticity. 3. **Design Goals of the VisMin Benchmark Test**: - By introducing pairs of images that change in only one aspect (such as objects, attributes, quantities, or spatial relationships), evaluate the VLMs' ability to understand these subtle changes. - Provide an automated data generation framework and ensure data quality through a strict four - step manual verification process. - Generate large - scale training data sets for fine - tuning existing VLMs to improve their fine - grained understanding ability. ### Solutions 1. **VisMin Benchmark Test**: - It includes four types of minimal changes: objects, attributes, quantities, and spatial relationships. - Only one aspect changes between each pair of images and description pairs, while other aspects remain unchanged. - Generate large - scale data through an automated pipeline and ensure data quality through strict manual verification. 2. **Fine - Tuning VLMs**: - Use the generated minimal - change data sets to fine - tune VLMs such as CLIP and Idefics2. - The experimental results show that the fine - tuned models show significant improvements in multiple benchmark tests, especially in fine - grained understanding and image - text alignment. ### Summary This paper aims to fill the gap in existing benchmark tests in evaluating the VLMs' ability to understand subtle changes in images by introducing the VisMin benchmark test, and improve the performance of VLMs in this area through large - scale data generation and fine - tuning techniques.

VisMin: Visual Minimal-Change Understanding

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Probing Conceptual Understanding of Large Visual-Language Models

Describing Differences in Image Sets with Natural Language

VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training

Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

Picturing Ambiguity: A Visual Twist on the Winograd Schema Challenge

Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation

Visually-Augmented Language Modeling

Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions