Abstract:How to accurately and efficiently assess AI-generated images (AIGIs) remains a critical challenge for generative models. Given the high costs and extensive time commitments required for user studies, many researchers have turned towards employing large multi-modal models (LMMs) as AIGI evaluators, the precision and validity of which are still questionable. Furthermore, traditional benchmarks often utilize mostly natural-captured content rather than AIGIs to test the abilities of LMMs, leading to a noticeable gap for AIGIs. Therefore, we introduce A-Bench in this paper, a benchmark designed to diagnose whether LMMs are masters at evaluating AIGIs. Specifically, A-Bench is organized under two key principles: 1) Emphasizing both high-level semantic understanding and low-level visual quality perception to address the intricate demands of AIGIs. 2) Various generative models are utilized for AIGI creation, and various LMMs are employed for evaluation, which ensures a comprehensive validation scope. Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts, and tested across 18 leading LMMs. We hope that A-Bench will significantly enhance the evaluation process and promote the generation quality for AIGIs. The benchmark is available at <a class="link-external link-https" href="https://github.com/Q-Future/A-Bench" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper attempts to address the issue of how to accurately and efficiently evaluate AI-generated images (AIGIs). Specifically, it focuses on the reliability and effectiveness of large multimodal models (LMMs) in evaluating AIGIs. Although many researchers have started using LMMs to assess AIGIs, the accuracy and effectiveness of these models remain questionable. Additionally, traditional benchmarks typically use naturally captured content rather than AIGIs to test the capabilities of LMMs, leading to a significant gap in the evaluation of AIGIs. ### Main Contributions 1. **Constructing the A-Bench Benchmark**: This benchmark includes 2,864 AIGIs (from various text-to-image models), each accompanied by a question-answer set annotated by human experts, covering both high-level semantic understanding and low-level quality perception. 2. **Detailed Exploration of "Diagnostic" Content**: Semantic understanding is subdivided into basic recognition, bag-of-words trap differentiation, and external knowledge realization; quality perception is subdivided into technical quality perception, aesthetic quality assessment, and generative distortion evaluation. 3. **Insights from Benchmark Results**: The benchmark results can diagnose various issues of different LMMs in AIGI evaluation and assist in their improvement. ### Main Findings 1. **Human Performance Surpasses All LMMs**: Even the worst human performance exceeds that of all LMMs, indicating that LMMs are still far inferior to humans in AIGI evaluation. 2. **Performance Differences Between Open-Source and Closed-Source LMMs**: Closed-source LMMs generally perform better than open-source LMMs. 3. **LMMs' Shortcomings in Complex Tasks**: LMMs perform well in basic recognition tasks but poorly in tasks requiring deeper semantic understanding and reasoning (such as bag-of-words trap differentiation, compositional recognition, and object counting). ### Conclusion LMMs cannot currently be considered experts in evaluating AIGIs. Although they perform well in some basic tasks, there is still a significant gap in complex semantic understanding and quality perception tasks, necessitating further research and improvement.

A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

AIBench: an Industry Standard AI Benchmark Suite from Internet Services.

AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM

MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

A Perceptual Quality Assessment Exploration for AIGC Images

AIGIQA-20K: A Large Database for AI-Generated Image Quality Assessment

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models

FakeBench: Uncover the Achilles' Heels of Fake Images with Large Multimodal Models

DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks

AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment

AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension

Can MLLMs Understand the Deep Implication Behind Chinese Images?

Evaluating Text-to-Visual Generation with Image-to-Text Generation

GenAI Arena: An Open Evaluation Platform for Generative Models