Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences

Weijian Luo
2024-10-25
Abstract:One-step text-to-image generator models offer advantages such as swift inference efficiency, flexible architectures, and state-of-the-art generation performance. In this paper, we study the problem of aligning one-step generator models with human preferences for the first time. Inspired by the success of reinforcement learning using human feedback (RLHF), we formulate the alignment problem as maximizing expected human reward functions while adding an Integral Kullback-Leibler divergence term to prevent the generator from diverging. By overcoming technical challenges, we introduce Diff-Instruct++ (DI++), the first, fast-converging and image data-free human preference alignment method for one-step text-to-image generators. We also introduce novel theoretical insights, showing that using CFG for diffusion distillation is secretly doing RLHF with DI++. Such an interesting finding brings understanding and potential contributions to future research involving CFG. In the experiment sections, we align both UNet-based and DiT-based one-step generators using DI++, which use the Stable Diffusion 1.5 and the PixelArt-$\alpha$ as the reference diffusion processes. The resulting DiT-based one-step text-to-image model achieves a strong Aesthetic Score of 6.19 and an Image Reward of 1.24 on the COCO validation prompt dataset. It also achieves a leading Human preference Score (HPSv2.0) of 28.48, outperforming other open-sourced models such as Stable Diffusion XL, DMD2, SD-Turbo, as well as PixelArt-$\alpha$. Both theoretical contributions and empirical evidence indicate that DI++ is a strong human-preference alignment approach for one-step text-to-image models.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of aligning **one - step generator models** with human preferences. Specifically, the existing one - step text - to - image generation models have the following limitations: 1. **Inadequate prompt - following ability**: The generated images may not accurately reflect the text prompts input by the user. 2. **Sub - optimal aesthetic quality**: The generated images may not be visually appealing or may not meet human aesthetic standards. 3. **Generation of harmful content**: In some cases, the model may generate inappropriate or toxic content. The root cause of these problems is that these generation models are not well - aligned with human preferences. Therefore, the author proposes a new method - **Diff - Instruct++ (DI++)** to train one - step generation models so that they can better conform to human preferences. ### Main contributions of DI++ 1. **First study on the alignment problem between one - step generation models and human preferences**: By maximizing the expected human reward function and introducing the Integral Kullback - Leibler (IKL) divergence term to prevent the generator from deviating from the reference diffusion model. 2. **Propose an effective loss function and training method**: DI++ is a fast - converging method without image data, which is suitable for one - step text - to - image generators. 3. **New theoretical insights**: It is proved that diffusion distillation using classifier - free guidance (CFG) is actually secretly performing Reinforcement Learning with an Implicit Reward Function (RLHF), which provides new tools and understanding for future research. 4. **Experimental verification**: The effectiveness of DI++ is demonstrated through multiple evaluation metrics (such as HPSv2.0, aesthetic scores, image rewards, etc.), especially the significant improvement on UNet - based and DiT - based one - step generation models. ### Formula presentation - **Objective function**: \[ L(\theta)=E_{c \sim C, x \sim p_{\theta}(x|c)}[-r(x, c)]+\beta D_{KL}(p_{\theta}(x|c), p_{\text{ref}}(x|c)) \] where \(r(x, c)\) is the reward model and \(p_{\text{ref}}(x|c)\) is the reference distribution. - **Gradient formula**: \[ \text{Grad}(\theta)=E_{c \sim C, z \sim p_{z}, x_{0}=g_{\theta}(z|c)}\left[-\nabla_{x} r(x_{0}, c)+\beta\left(\nabla_{x} \log p_{\theta}(x_{0}|c)-\nabla_{x} \log p_{\text{ref}}(x_{0}|c)\right) \frac{\partial x_{0}}{\partial \theta}\right] \] - **Objective function of IKL regularization**: \[ L(\theta)=E_{c, z \sim p_{z}, x_{0}=g_{\theta}(z|c), x_{t}|x_{0} \sim p(x_{t}|x_{0})}\left[-r(x_{0}, c)\right]+\beta \int_{0}^{T} w(t) D_{KL}(p_{\theta}(x_{t}|t, c), p_{\text{ref}}(x_{t}|t, c)) d t \] Through these formulas, DI++ can effectively align one - step generation models with human preferences, thereby improving the quality and consistency of generated images.