Abstract:Aligning text-to-image diffusion model (T2I) with preference has been gaining increasing research attention. While prior works exist on directly optimizing T2I by preference data, these methods are developed under the bandit assumption of a latent reward on the entire diffusion reverse chain, while ignoring the sequential nature of the generation process. From literature, this may harm the efficacy and efficiency of alignment. In this paper, we take on a finer dense reward perspective and derive a tractable alignment objective that emphasizes the initial steps of the T2I reverse chain. In particular, we introduce temporal discounting into the DPO-style explicit-reward-free loss, to break the temporal symmetry therein and suit the T2I generation hierarchy. In experiments on single and multiple prompt generation, our method is competitive with strong relevant baselines, both quantitatively and qualitatively. Further studies are conducted to illustrate the insight of our approach.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to align human preferences more effectively in text - to - image generation models (T2I). Specifically, existing methods usually assume that there is a latent reward function in the entire diffusion inverse chain (i.e., the generation process) when optimizing T2I models, which leads to problems of an overly large decision space and sparse rewards, thus affecting the efficiency and effectiveness of model training. This paper proposes a new method to break time symmetry and emphasize the initial steps of the generation process by introducing a dense - reward perspective and a time - discount factor, in order to improve the efficiency and effectiveness of aligning human preferences. ### Main Contributions 1. **Introducing the Dense - Reward Perspective**: Different from the traditional methods that assume there is a latent reward function for the entire generation trajectory, this paper assumes that there is a latent reward function for each step of generation, which makes the learning problem easier. 2. **Time - Discount Factor**: In order to emphasize the initial steps of the generation process, this paper introduces a time - discount factor, which helps to improve the efficiency and effectiveness of model training. 3. **Theoretical Analysis**: Through theoretical analysis, it is proved that under certain conditions, using the time - discount factor can effectively reduce the search space and improve the performance of the model. 4. **Experimental Verification**: Experiments were carried out on single - prompt and multi - prompt generation tasks to verify the effectiveness of the proposed method, especially in terms of the quality of generated images and preference alignment. ### Method Overview 1. **Problem Definition**: - Assume that there is a latent dense - reward function \( r(s_t, a_t) \) that can score each step of generation. - Model the diffusion inverse process as a Markov decision process (MDP), where the state \( s_t \) includes the current image \( x_t \), the time step \( t \), and the text condition \( c \). 2. **Objective Function**: - Define an objective function \( e(\tau) \) with a time - discount factor to evaluate the quality of the generation trajectory. - Introduce a KL regularization term to avoid generating unnatural images. 3. **Optimization Method**: - Obtain the optimal policy \( \pi^* \) by solving the first - order conditions in the Lagrangian form. - Use the approximate off - policy objective function and train the model by maximizing the lower bound. 4. **Experimental Setup**: - In the single - prompt generation task, use the prompts in the DPOK dataset to verify the effectiveness of the method. - In the multi - prompt generation task, use the HPSv2 dataset to further verify the generalization ability of the method. ### Experimental Results - **Single - Prompt Generation**: On multiple tasks (such as color, quantity, combination, and position), the proposed method outperforms the baseline methods in both ImageReward and Aesthetic scores, especially in terms of the fidelity and aesthetic quality of the generated images. - **Multi - Prompt Generation**: On the HPSv2 dataset, the proposed method achieves the best results in both HPSv2 and Aesthetic scores, verifying the effectiveness of the method on complex tasks. ### Conclusion This paper provides a more effective method for aligning text - to - image generation models with human preferences by introducing the dense - reward perspective and the time - discount factor. The experimental results show that this method is superior to existing methods in terms of the quality of generated images and preference alignment.

A Dense Reward View on Aligning Text-to-Image Diffusion with Preference

Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization

Aligning Diffusion Models by Optimizing Human Utility

Improving Long-Text Alignment for Text-to-Image Diffusion Models

Aligning Text-to-Image Diffusion Models with Reward Backpropagation

Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences

Diffusion Model Alignment Using Direct Preference Optimization

ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning

Elucidating Optimal Reward-Diversity Tradeoffs in Text-to-Image Diffusion Models

Confronting Reward Overoptimization for Diffusion Models: A Perspective of Inductive and Primacy Biases

Aligning Diffusion Models with Noise-Conditioned Perception

Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

Regularized Conditional Diffusion Model for Multi-Task Preference Alignment

Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models

MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences

Text-image Alignment for Diffusion-based Perception

RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment