Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Wujian Peng,Sicheng Xie,Zuyao You,Shiyi Lan,Zuxuan Wu
2024-03-30
Abstract:Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary focus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we propose a simple yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach. Code and data are available at <a class="link-external link-https" href="https://github.com/wjpoom/SPEC" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper focuses on the challenges faced by Visual-Language Models (VLMs) in understanding and parsing fine-grained visual-language concepts such as attributes and relationships between objects. Despite the impressive performance of existing VLMs in various downstream tasks, their ability to understand complex concepts such as size, position, existence, and quantity still needs improvement. The paper proposes a progressive data generation process to create a set of images with specific attribute variations while keeping other aspects consistent, in order to diagnose the understanding capabilities of VLMs. Through this data engine, they design a new benchmark dataset called SPEC to evaluate the models' understanding of object size, position, existence, and counting. The research reveals that even state-of-the-art VLMs perform close to random guessing on this task, highlighting significant limitations. To enhance the fine-grained understanding of VLMs, the paper introduces a simple yet effective method of incorporating challenging negative samples during the training process, encouraging the model to recognize subtle differences between candidate examples. This significantly improves performance on SPEC without sacrificing zero-shot performance. Additionally, they demonstrate improvements on two other benchmark datasets focusing on fine-grained reasoning, validating the transferability of the approach. In summary, the goal of the paper is to enhance the ability of VLMs to understand and parse complex visual-language concepts, and through the creation of new evaluation tools and optimization strategies, reveal and alleviate the current limitations of existing models.