Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu,Yiming Hao,Keqiang Sun,Yixiong Chen,Feng Zhu,Rui Zhao,Hongsheng Li

2023-09-25

Abstract:Recent text-to-image generative models can generate high-fidelity images from text inputs, but the quality of these generated images cannot be accurately evaluated by existing evaluation metrics. To address this issue, we introduce Human Preference Dataset v2 (HPD v2), a large-scale dataset that captures human preferences on images from a wide range of sources. HPD v2 comprises 798,090 human preference choices on 433,760 pairs of images, making it the largest dataset of its kind. The text prompts and images are deliberately collected to eliminate potential bias, which is a common issue in previous datasets. By fine-tuning CLIP on HPD v2, we obtain Human Preference Score v2 (HPS v2), a scoring model that can more accurately predict human preferences on generated images. Our experiments demonstrate that HPS v2 generalizes better than previous metrics across various image distributions and is responsive to algorithmic improvements of text-to-image generative models, making it a preferable evaluation metric for these models. We also investigate the design of the evaluation prompts for text-to-image generative models, to make the evaluation stable, fair and easy-to-use. Finally, we establish a benchmark for text-to-image generative models using HPS v2, which includes a set of recent text-to-image models from the academic, community and industry. The code and dataset is available at <a class="link-external link-https" href="https://github.com/tgxs002/HPSv2" rel="external noopener nofollow">this https URL</a> .

Computer Vision and Pattern Recognition,Artificial Intelligence,Databases

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in text - to - image generation models, existing evaluation metrics (such as Inception Score, Fréchet Inception Distance, and CLIP Score) cannot accurately reflect human preferences for generated images. Specifically, although these models can generate high - fidelity images, there are significant differences between their quality evaluations and human real - preferences. To solve this problem, the authors constructed a large - scale dataset - Human Preference Dataset v2 (HPD v2), which contains a large number of human preference selections for images generated based on text prompts. By fine - tuning the CLIP model on HPD v2, they obtained a new human preference prediction model - Human Preference Score v2 (HPS v2). HPS v2 aims to more accurately predict human preferences for generated images and shows better generalization ability on various image distributions, and can respond to algorithmic improvements in text - to - image generation models. In addition, the author also studied the design of evaluation prompts for evaluating text - to - image generation models to make the evaluation more stable, fair, and easy to use. Finally, they established a benchmark test based on HPS v2, including the latest text - to - image models from academia, the community, and industry.

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Human Preference Score: Better Aligning Text-to-Image Models with Human Preference

Learning Multi-dimensional Human Preference for Text-to-Image Generation

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation

Stable Preference: Redefining Training Paradigm of Human Preference Model for Text-to-Image Synthesis

Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis

Scalable Ranked Preference Optimization for Text-to-Image Generation

HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models

Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

Boost Your Own Human Image Generation Model via Direct Preference Optimization with AI Feedback

FaceScore: Benchmarking and Enhancing Face Quality in Human Generation

Text2Human

HWD: A Novel Evaluation Score for Styled Handwritten Text Generation

PrefIQA: Human Preference Learning for AI-generated Image Quality Assessment

T2I-Scorer: Quantitative Evaluation on Text-to-Image Generation Via Fine-Tuned Large Multi-Modal Models

Learning and Evaluating Human Preferences for Conversational Head Generation

Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation