Abstract:Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information, facilitating a wide range of complex applications and tasks. However, the evaluation of LVLMs presents significant challenges as the evaluation benchmark always demands lots of human cost for its construction, and remains static, lacking flexibility once constructed. Even though automatic evaluation has been explored in textual modality, the visual modality remains under-explored. As a result, in this work, we address a question: "Can LVLMs serve as a path to automatic benchmarking?". We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability. Upon receiving an evaluation capability, AutoBench-V leverages text-to-image models to generate relevant image samples and then utilizes LVLMs to orchestrate visual question-answering (VQA) tasks, completing the evaluation process efficiently and flexibly. Through an extensive evaluation of seven popular LVLMs across five demanded user inputs (i.e., evaluation capabilities), the framework shows effectiveness and reliability. We observe the following: (1) Our constructed benchmark accurately reflects varying task difficulties; (2) As task difficulty rises, the performance gap between models widens; (3) While models exhibit strong performance in abstract level understanding, they underperform in details reasoning tasks; and (4) Constructing a dataset with varying levels of difficulties is critical for a comprehensive and exhaustive evaluation. Overall, AutoBench-V not only successfully utilizes LVLMs for automated benchmarking but also reveals that LVLMs as judges have significant potential in various domains.

An Explainable Toolbox for Evaluating Pre-trained Vision-Language Models.

VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations

VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

Rethinking Overlooked Aspects in Vision-Language Models

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

VILA: On Pre-training for Visual Language Models

ViLTA: Enhancing Vision-Language Pre-training Through Textual Augmentation

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

Unified Vision-Language Pre-Training for Image Captioning and VQA

What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

VindLU: A Recipe for Effective Video-and-Language Pretraining

A Survey of Vision-Language Pre-Trained Models

Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends

LOVM: Language-Only Vision Model Selection

Playing Lottery Tickets with Vision and Language

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Scalable Performance Analysis for Vision-Language Models

Leveraging per Image-Token Consistency for Vision-Language Pre-training

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

VLP: A Survey on Vision-language Pre-training