Abstract:Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information, facilitating a wide range of complex applications and tasks. However, the evaluation of LVLMs presents significant challenges as the evaluation benchmark always demands lots of human cost for its construction, and remains static, lacking flexibility once constructed. Even though automatic evaluation has been explored in textual modality, the visual modality remains under-explored. As a result, in this work, we address a question: "Can LVLMs serve as a path to automatic benchmarking?". We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability. Upon receiving an evaluation capability, AutoBench-V leverages text-to-image models to generate relevant image samples and then utilizes LVLMs to orchestrate visual question-answering (VQA) tasks, completing the evaluation process efficiently and flexibly. Through an extensive evaluation of seven popular LVLMs across five demanded user inputs (i.e., evaluation capabilities), the framework shows effectiveness and reliability. We observe the following: (1) Our constructed benchmark accurately reflects varying task difficulties; (2) As task difficulty rises, the performance gap between models widens; (3) While models exhibit strong performance in abstract level understanding, they underperform in details reasoning tasks; and (4) Constructing a dataset with varying levels of difficulties is critical for a comprehensive and exhaustive evaluation. Overall, AutoBench-V not only successfully utilizes LVLMs for automated benchmarking but also reveals that LVLMs as judges have significant potential in various domains.

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to automate the evaluation of large-scale vision-language models (LVLMs). Specifically, existing LVLM evaluation benchmarks have the following issues: 1. **High Cost**: Constructing evaluation benchmarks requires a significant amount of human labor. 2. **Static Nature**: Once constructed, evaluation benchmarks lack flexibility and are difficult to adjust according to new requirements. 3. **Insufficient Automated Evaluation of Visual Modality**: Although there are some automated evaluation methods for the text modality, automated evaluation in the visual modality is still less explored. To address these issues, the paper proposes an automated framework named AUTOBENCH-V, which can generate evaluation tasks based on user requirements to efficiently and flexibly assess the capabilities of LVLMs. Specifically, AUTOBENCH-V achieves this goal through the following steps: 1. **User Requirement Processing**: Receives user input requirements and determines the specific capabilities that need to be evaluated. 2. **Hierarchical Aspect Generation**: Decomposes user requirements into multiple high-level and fine-grained evaluation aspects. 3. **Image Description Generation**: Generates image descriptions of different difficulty levels based on the generated evaluation aspects. 4. **Image Generation and Self-Verification**: Uses text-to-image models to generate corresponding images and verifies the consistency between the images and descriptions through visual question answering (VQA) tasks. 5. **Question Generation and Evaluation**: Generates evaluation questions and their reference answers, presents these questions to the LVLMs to be evaluated, and ultimately assesses their performance. Through these steps, AUTOBENCH-V not only automates the generation of evaluation tasks but also reduces human intervention, improving the efficiency and objectivity of the evaluation. The paper validates the effectiveness and reliability of this framework through extensive experiments, revealing the performance characteristics of LVLMs in different tasks and providing valuable insights for further research.

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology

AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

MMBench: Is Your Multi-modal Model an All-around Player?

DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model

ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models

VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition

Effectiveness Assessment of Recent Large Vision-Language Models