AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

Han Bao,Yue Huang,Yanbo Wang,Jiayi Ye,Xiangqi Wang,Xiuying Chen,Mohamed Elhoseiny,Xiangliang Zhang
2024-10-29
Abstract:Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information, facilitating a wide range of complex applications and tasks. However, the evaluation of LVLMs presents significant challenges as the evaluation benchmark always demands lots of human cost for its construction, and remains static, lacking flexibility once constructed. Even though automatic evaluation has been explored in textual modality, the visual modality remains under-explored. As a result, in this work, we address a question: "Can LVLMs serve as a path to automatic benchmarking?". We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability. Upon receiving an evaluation capability, AutoBench-V leverages text-to-image models to generate relevant image samples and then utilizes LVLMs to orchestrate visual question-answering (VQA) tasks, completing the evaluation process efficiently and flexibly. Through an extensive evaluation of seven popular LVLMs across five demanded user inputs (i.e., evaluation capabilities), the framework shows effectiveness and reliability. We observe the following: (1) Our constructed benchmark accurately reflects varying task difficulties; (2) As task difficulty rises, the performance gap between models widens; (3) While models exhibit strong performance in abstract level understanding, they underperform in details reasoning tasks; and (4) Constructing a dataset with varying levels of difficulties is critical for a comprehensive and exhaustive evaluation. Overall, AutoBench-V not only successfully utilizes LVLMs for automated benchmarking but also reveals that LVLMs as judges have significant potential in various domains.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem this paper attempts to address is: how to automate the evaluation of large-scale vision-language models (LVLMs). Specifically, existing LVLM evaluation benchmarks have the following issues: 1. **High Cost**: Constructing evaluation benchmarks requires a significant amount of human labor. 2. **Static Nature**: Once constructed, evaluation benchmarks lack flexibility and are difficult to adjust according to new requirements. 3. **Insufficient Automated Evaluation of Visual Modality**: Although there are some automated evaluation methods for the text modality, automated evaluation in the visual modality is still less explored. To address these issues, the paper proposes an automated framework named AUTOBENCH-V, which can generate evaluation tasks based on user requirements to efficiently and flexibly assess the capabilities of LVLMs. Specifically, AUTOBENCH-V achieves this goal through the following steps: 1. **User Requirement Processing**: Receives user input requirements and determines the specific capabilities that need to be evaluated. 2. **Hierarchical Aspect Generation**: Decomposes user requirements into multiple high-level and fine-grained evaluation aspects. 3. **Image Description Generation**: Generates image descriptions of different difficulty levels based on the generated evaluation aspects. 4. **Image Generation and Self-Verification**: Uses text-to-image models to generate corresponding images and verifies the consistency between the images and descriptions through visual question answering (VQA) tasks. 5. **Question Generation and Evaluation**: Generates evaluation questions and their reference answers, presents these questions to the LVLMs to be evaluated, and ultimately assesses their performance. Through these steps, AUTOBENCH-V not only automates the generation of evaluation tasks but also reduces human intervention, improving the efficiency and objectivity of the evaluation. The paper validates the effectiveness and reliability of this framework through extensive experiments, revealing the performance characteristics of LVLMs in different tasks and providing valuable insights for further research.