Abstract:Generating visually appealing images is fundamental to modern text-to-image generation models. A potential solution to better aesthetics is direct preference optimization (DPO), which has been applied to diffusion models to improve general image quality including prompt alignment and aesthetics. Popular DPO methods propagate preference labels from clean image pairs to all the intermediate steps along the two generation trajectories. However, preference labels provided in existing datasets are blended with layout and aesthetic opinions, which would disagree with aesthetic preference. Even if aesthetic labels were provided (at substantial cost), it would be hard for the two-trajectory methods to capture nuanced visual differences at different steps. To improve aesthetics economically, this paper uses existing generic preference data and introduces step-by-step preference optimization (SPO) that discards the propagation strategy and allows fine-grained image details to be assessed. Specifically, at each denoising step, we 1) sample a pool of candidates by denoising from a shared noise latent, 2) use a step-aware preference model to find a suitable win-lose pair to supervise the diffusion model, and 3) randomly select one from the pool to initialize the next denoising step. This strategy ensures that the diffusion models to focus on the subtle, fine-grained visual differences instead of layout aspect. We find that aesthetic can be significantly enhanced by accumulating these improved minor differences. When fine-tuning Stable Diffusion v1.5 and SDXL, SPO yields significant improvements in aesthetics compared with existing DPO methods while not sacrificing image-text alignment compared with vanilla models. Moreover, SPO converges much faster than DPO methods due to the step-by-step alignment of fine-grained visual details. Code and models are available at <a class="link-external link-https" href="https://github.com/RockeyCoss/SPO" rel="external noopener nofollow">this https URL</a>.

Stable Preference: Redefining Training Paradigm of Human Preference Model for Text-to-Image Synthesis

Human Preference Score: Better Aligning Text-to-Image Models with Human Preference

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences

Boost Your Own Human Image Generation Model via Direct Preference Optimization with AI Feedback

Scalable Ranked Preference Optimization for Text-to-Image Generation

Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization

Learning Multi-dimensional Human Preference for Text-to-Image Generation

Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis

Fine Tuning Text-to-Image Diffusion Models for Correcting Anomalous Images

PrefIQA: Human Preference Learning for AI-generated Image Quality Assessment

Human Aesthetic Preference-Based Large Text-to-Image Model Personalization: Kandinsky Generation as an Example

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation

Imaginique Expressions: Tailoring Personalized Short-Text-to-Image Generation Through Aesthetic Assessment and Human Insights