A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

Zicheng Zhang,Haoning Wu,Chunyi Li,Yingjie Zhou,Wei Sun,Xiongkuo Min,Zijian Chen,Xiaohong Liu,Weisi Lin,Guangtao Zhai
2024-06-05
Abstract:How to accurately and efficiently assess AI-generated images (AIGIs) remains a critical challenge for generative models. Given the high costs and extensive time commitments required for user studies, many researchers have turned towards employing large multi-modal models (LMMs) as AIGI evaluators, the precision and validity of which are still questionable. Furthermore, traditional benchmarks often utilize mostly natural-captured content rather than AIGIs to test the abilities of LMMs, leading to a noticeable gap for AIGIs. Therefore, we introduce A-Bench in this paper, a benchmark designed to diagnose whether LMMs are masters at evaluating AIGIs. Specifically, A-Bench is organized under two key principles: 1) Emphasizing both high-level semantic understanding and low-level visual quality perception to address the intricate demands of AIGIs. 2) Various generative models are utilized for AIGI creation, and various LMMs are employed for evaluation, which ensures a comprehensive validation scope. Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts, and tested across 18 leading LMMs. We hope that A-Bench will significantly enhance the evaluation process and promote the generation quality for AIGIs. The benchmark is available at <a class="link-external link-https" href="https://github.com/Q-Future/A-Bench" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper attempts to address the issue of how to accurately and efficiently evaluate AI-generated images (AIGIs). Specifically, it focuses on the reliability and effectiveness of large multimodal models (LMMs) in evaluating AIGIs. Although many researchers have started using LMMs to assess AIGIs, the accuracy and effectiveness of these models remain questionable. Additionally, traditional benchmarks typically use naturally captured content rather than AIGIs to test the capabilities of LMMs, leading to a significant gap in the evaluation of AIGIs. ### Main Contributions 1. **Constructing the A-Bench Benchmark**: This benchmark includes 2,864 AIGIs (from various text-to-image models), each accompanied by a question-answer set annotated by human experts, covering both high-level semantic understanding and low-level quality perception. 2. **Detailed Exploration of "Diagnostic" Content**: Semantic understanding is subdivided into basic recognition, bag-of-words trap differentiation, and external knowledge realization; quality perception is subdivided into technical quality perception, aesthetic quality assessment, and generative distortion evaluation. 3. **Insights from Benchmark Results**: The benchmark results can diagnose various issues of different LMMs in AIGI evaluation and assist in their improvement. ### Main Findings 1. **Human Performance Surpasses All LMMs**: Even the worst human performance exceeds that of all LMMs, indicating that LMMs are still far inferior to humans in AIGI evaluation. 2. **Performance Differences Between Open-Source and Closed-Source LMMs**: Closed-source LMMs generally perform better than open-source LMMs. 3. **LMMs' Shortcomings in Complex Tasks**: LMMs perform well in basic recognition tasks but poorly in tasks requiring deeper semantic understanding and reasoning (such as bag-of-words trap differentiation, compositional recognition, and object counting). ### Conclusion LMMs cannot currently be considered experts in evaluating AIGIs. Although they perform well in some basic tasks, there is still a significant gap in complex semantic understanding and quality perception tasks, necessitating further research and improvement.