Abstract:Efficiently evaluating the performance of text-to-image models is difficult as it inherently requires subjective judgment and human preference, making it hard to compare different models and quantify the state of the art. Leveraging Rapidata's technology, we present an efficient annotation framework that sources human feedback from a diverse, global pool of annotators. Our study collected over 2 million annotations across 4,512 images, evaluating four prominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on style preference, coherence, and text-to-image alignment. We demonstrate that our approach makes it feasible to comprehensively rank image generation models based on a vast pool of annotators and show that the diverse annotator demographics reflect the world population, significantly decreasing the risk of biases.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to effectively evaluate the performance of text - to - image generation models. Since this kind of evaluation essentially requires subjective judgment and human preference, it is very difficult to compare different models and quantify the current state - of - the - art. Specifically, the main challenges include: 1. **Subjectivity**: The evaluation of image quality usually requires subjective judgment, which complicates the benchmarking between different models. 2. **Lack of standardized benchmarks**: Unlike standardized benchmarks such as ImageNet or COCO in the field of computer vision, text - to - image generation models lack a widely - accepted benchmark. 3. **Limitations of existing methods**: Existing evaluation methods either rely on a limited user panel or use AI models trained to predict human preferences, and these methods are difficult to capture a wide range of human preferences and cultural backgrounds. To solve these problems, the author proposes a new paradigm to access a large number of annotators worldwide through Rapidata's technology and collect large - scale human feedback in an efficient and low - cost manner. Specific contributions include: - Introducing a new large - scale human preference collection annotation process, which can obtain a large amount of feedback in a short time and at a low cost. - Compiling a carefully curated set of 282 image generation prompts, covering a wide range of evaluation criteria. - Ranking four major text - to - image generation models (Flux.1, DALL - E 3, Stable Diffusion, and MidJourney) based on more than 2 million human votes. This method not only provides the most comprehensive benchmark for text - to - image generation models so far, but also ensures the diversity of annotators, reflecting the distribution of the global population, thereby significantly reducing the risk of bias.

Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation

GenAI Arena: An Open Evaluation Platform for Generative Models

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark

Towards Geographic Inclusion in the Evaluation of Text-to-Image Models

Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Community

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

Holistic Evaluation of Text-To-Image Models

Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis

Dynamic Human Evaluation for Relative Model Comparisons

SelfEval: Leveraging the discriminative nature of generative models for evaluation

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models

A Novel Evaluation Framework for Image2Text Generation

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models

Estimating Subjective Crowd-Evaluations as an Additional Objective to Improve Natural Language Generation

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Appeal and quality assessment for AI-generated images

Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

Advancing Generative Model Evaluation: A Novel Algorithm for Realistic Image Synthesis and Comparison in OCR System