Curriculum Direct Preference Optimization for Diffusion and Consistency Models

Florinel-Alin Croitoru,Vlad Hondru,Radu Tudor Ionescu,Nicu Sebe,Mubarak Shah

2024-05-24

Abstract:Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). In this paper, we propose a novel and enhanced version of DPO based on curriculum learning for text-to-image generation. Our method is divided into two training stages. First, a ranking of the examples generated for each prompt is obtained by employing a reward model. Then, increasingly difficult pairs of examples are sampled and provided to a text-to-image generative (diffusion or consistency) model. Generated samples that are far apart in the ranking are considered to form easy pairs, while those that are close in the ranking form hard pairs. In other words, we use the rank difference between samples as a measure of difficulty. The sampled pairs are split into batches according to their difficulty levels, which are gradually used to train the generative model. Our approach, Curriculum DPO, is compared against state-of-the-art fine-tuning approaches on three benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://anonymous.4open.science/r/Curriculum-DPO-EE14.

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the problem of how to optimize generative models in text-to-image generation tasks to better align with human preferences, particularly in terms of text alignment, visual aesthetics, and human preferences. Although existing Direct Preference Optimization (DPO) methods are effective, they randomly sample generated sample pairs during training, resulting in low training efficiency and room for improvement in generation quality. Therefore, the authors propose an enhanced DPO method based on curriculum learning—Curriculum DPO, which aims to improve training effectiveness by gradually introducing sample pairs with increasing difficulty. Specifically, the main contributions of the paper include: 1. **Introducing Curriculum DPO**: A new training strategy that gradually optimizes diffusion models and consistency models through curriculum learning to improve the quality of generated samples. 2. **Adapting Consistency Models**: Proposing a DPO method suitable for consistency models (Consistency-DPO), achieving short training and inference times. 3. **Experimental Validation**: Demonstrating the effectiveness of Curriculum DPO over existing state-of-the-art methods on three evaluation benchmarks, particularly in terms of text alignment, visual aesthetics, and human preferences. Through these improvements, Curriculum DPO is able to generate high-quality images while better meeting human preferences.

Curriculum Direct Preference Optimization for Diffusion and Consistency Models

Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization

Diffusion Model Alignment Using Direct Preference Optimization

SEE-DPO: Self Entropy Enhanced Direct Preference Optimization

Scalable Ranked Preference Optimization for Text-to-Image Generation

Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences

SePPO: Semi-Policy Preference Optimization for Diffusion Alignment

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization

Direct Preference Optimization With Unobserved Preference Heterogeneity

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

New Desiderata for Direct Preference Optimization

Filtered Direct Preference Optimization

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Direct Preference Optimization with an Offset

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences

Aligning Diffusion Models with Noise-Conditioned Perception

On Discrete Prompt Optimization for Diffusion Models

Optimizing Preference Alignment with Differentiable NDCG Ranking