Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation

Dimitrios Christodoulou,Mads Kuhlmann-Jørgensen
2024-10-15
Abstract:Efficiently evaluating the performance of text-to-image models is difficult as it inherently requires subjective judgment and human preference, making it hard to compare different models and quantify the state of the art. Leveraging Rapidata's technology, we present an efficient annotation framework that sources human feedback from a diverse, global pool of annotators. Our study collected over 2 million annotations across 4,512 images, evaluating four prominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on style preference, coherence, and text-to-image alignment. We demonstrate that our approach makes it feasible to comprehensively rank image generation models based on a vast pool of annotators and show that the diverse annotator demographics reflect the world population, significantly decreasing the risk of biases.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to effectively evaluate the performance of text - to - image generation models. Since this kind of evaluation essentially requires subjective judgment and human preference, it is very difficult to compare different models and quantify the current state - of - the - art. Specifically, the main challenges include: 1. **Subjectivity**: The evaluation of image quality usually requires subjective judgment, which complicates the benchmarking between different models. 2. **Lack of standardized benchmarks**: Unlike standardized benchmarks such as ImageNet or COCO in the field of computer vision, text - to - image generation models lack a widely - accepted benchmark. 3. **Limitations of existing methods**: Existing evaluation methods either rely on a limited user panel or use AI models trained to predict human preferences, and these methods are difficult to capture a wide range of human preferences and cultural backgrounds. To solve these problems, the author proposes a new paradigm to access a large number of annotators worldwide through Rapidata's technology and collect large - scale human feedback in an efficient and low - cost manner. Specific contributions include: - Introducing a new large - scale human preference collection annotation process, which can obtain a large amount of feedback in a short time and at a low cost. - Compiling a carefully curated set of 282 image generation prompts, covering a wide range of evaluation criteria. - Ranking four major text - to - image generation models (Flux.1, DALL - E 3, Stable Diffusion, and MidJourney) based on more than 2 million human votes. This method not only provides the most comprehensive benchmark for text - to - image generation models so far, but also ensures the diversity of annotators, reflecting the distribution of the global population, thereby significantly reducing the risk of bias.