Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Wujian Peng,Sicheng Xie,Zuyao You,Shiyi Lan,Zuxuan Wu

2024-03-30

Abstract:Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary focus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we propose a simple yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach. Code and data are available at <a class="link-external link-https" href="https://github.com/wjpoom/SPEC" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper focuses on the challenges faced by Visual-Language Models (VLMs) in understanding and parsing fine-grained visual-language concepts such as attributes and relationships between objects. Despite the impressive performance of existing VLMs in various downstream tasks, their ability to understand complex concepts such as size, position, existence, and quantity still needs improvement. The paper proposes a progressive data generation process to create a set of images with specific attribute variations while keeping other aspects consistent, in order to diagnose the understanding capabilities of VLMs. Through this data engine, they design a new benchmark dataset called SPEC to evaluate the models' understanding of object size, position, existence, and counting. The research reveals that even state-of-the-art VLMs perform close to random guessing on this task, highlighting significant limitations. To enhance the fine-grained understanding of VLMs, the paper introduces a simple yet effective method of incorporating challenging negative samples during the training process, encouraging the model to recognize subtle differences between candidate examples. This significantly improves performance on SPEC without sacrificing zero-shot performance. Additionally, they demonstrate improvements on two other benchmark datasets focusing on fine-grained reasoning, validating the transferability of the approach. In summary, the goal of the paper is to enhance the ability of VLMs to understand and parse complex visual-language concepts, and through the creation of new evaluation tools and optimization strategies, reveal and alleviate the current limitations of existing models.

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models

VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

POINTS: Improving Your Vision-language Model with Affordable Strategies

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding

Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects

SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models

MMBench: Is Your Multi-modal Model an All-around Player?

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

DeepSeek-VL: Towards Real-World Vision-Language Understanding

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?