Abstract:While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-visual generation. We also compare automated evaluation metrics against our collected human ratings and find that VQAScore -- a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt -- significantly outperforms previous metrics such as CLIPScore. In addition, VQAScore can improve generation in a black-box manner (without finetuning) via simply ranking a few (3 to 9) candidate images. Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore, HPSv2, and ImageReward at improving human alignment ratings for DALL-E 3 and Stable Diffusion, especially on compositional prompts that require advanced visio-linguistic reasoning. We will release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics on ranking images generated from the same prompt. Lastly, we discuss promising areas for improvement in VQAScore, such as addressing fine-grained visual details. We will release all human ratings (over 80,000) to facilitate scientific benchmarking of both generative models and automated metrics.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the poor performance of current text-to-visual generation models when handling complex text prompts. Specifically, these models struggle with generating images and videos that involve attributes, relationships, and high-level reasoning such as logic and comparison. The paper tackles these issues through the following aspects: 1. **Evaluating Existing Models**: - Conduct extensive human studies on leading image and video generation models using the **GenAI-Bench** benchmark to assess their performance in various compositional text-to-visual generation tasks. - Compare automated evaluation metrics with collected human scores, finding that the **VQAScore** metric significantly outperforms previous metrics (e.g., **CLIPScore**). 2. **Improving Generation Quality**: - Enhance the quality of generated images through a simple ranking method based on **VQAScore**, without the need for fine-tuning. Specifically, by selecting the image with the highest **VQAScore** from several candidates, human scores can be significantly improved. 3. **Releasing New Benchmarks**: - Release the **GenAI-Rank** benchmark, which includes over 40,000 human scores, to evaluate the ability of different scoring metrics to rank images generated from the same prompt. - Publish all human scores (over 80,000) to facilitate scientific benchmarking of generation models and automated metrics. ### Main Contributions 1. **Extensive Evaluation Study**: - Conduct a comprehensive human study on compositional text-to-visual generation using **GenAI-Bench**, revealing the limitations of existing open-source and closed-source models. 2. **Improvement Method for Generation**: - Propose a simple yet effective black-box method to improve generation quality by ranking candidate images using **VQAScore**, significantly outperforming other scoring methods. 3. **New Benchmark Dataset**: - Release **GenAI-Rank**, which includes a large number of human scores for evaluating automated metrics for image ranking. ### Future Work Although **VQAScore** performs well in many aspects, it still has some limitations, such as poor performance in handling fine-grained visual details and language ambiguities. Future work could explore higher-resolution VQA models and more powerful language models to address these issues. Additionally, **GenAI-Bench** has not yet evaluated important aspects such as the toxicity, bias, aesthetics, and video motion of generation models, which can be further expanded in future evaluations.

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

Evaluating Text-to-Visual Generation with Image-to-Text Generation

VBench: Comprehensive Benchmark Suite for Video Generative Models

A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

GenAI Arena: An Open Evaluation Platform for Generative Models

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

Exploring AIGC Video Quality: A Focus on Visual Harmony, Video-Text Consistency and Domain Distribution Gap

A study of the evaluation metrics for generative images containing combinational creativity

DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design

StoryBench: A Multifaceted Benchmark for Continuous Story Visualization

Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation