Abstract:Aligning text-to-image diffusion model (T2I) with preference has been gaining
increasing research attention. While prior works exist on directly optimizing
T2I by preference data, these methods are developed under the bandit assumption
of a latent reward on the entire diffusion reverse chain, while ignoring the
sequential nature of the generation process. From literature, this may harm the
efficacy and efficiency of alignment. In this paper, we take on a finer dense
reward perspective and derive a tractable alignment objective that emphasizes
the initial steps of the T2I reverse chain. In particular, we introduce
temporal discounting into the DPO-style explicit-reward-free loss, to break the
temporal symmetry therein and suit the T2I generation hierarchy. In experiments
on single and multiple prompt generation, our method is competitive with strong
relevant baselines, both quantitatively and qualitatively. Further studies are
conducted to illustrate the insight of our approach.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to align human preferences more effectively in text - to - image generation models (T2I). Specifically, existing methods usually assume that there is a latent reward function in the entire diffusion inverse chain (i.e., the generation process) when optimizing T2I models, which leads to problems of an overly large decision space and sparse rewards, thus affecting the efficiency and effectiveness of model training. This paper proposes a new method to break time symmetry and emphasize the initial steps of the generation process by introducing a dense - reward perspective and a time - discount factor, in order to improve the efficiency and effectiveness of aligning human preferences.
### Main Contributions
1. **Introducing the Dense - Reward Perspective**: Different from the traditional methods that assume there is a latent reward function for the entire generation trajectory, this paper assumes that there is a latent reward function for each step of generation, which makes the learning problem easier.
2. **Time - Discount Factor**: In order to emphasize the initial steps of the generation process, this paper introduces a time - discount factor, which helps to improve the efficiency and effectiveness of model training.
3. **Theoretical Analysis**: Through theoretical analysis, it is proved that under certain conditions, using the time - discount factor can effectively reduce the search space and improve the performance of the model.
4. **Experimental Verification**: Experiments were carried out on single - prompt and multi - prompt generation tasks to verify the effectiveness of the proposed method, especially in terms of the quality of generated images and preference alignment.
### Method Overview
1. **Problem Definition**:
- Assume that there is a latent dense - reward function \( r(s_t, a_t) \) that can score each step of generation.
- Model the diffusion inverse process as a Markov decision process (MDP), where the state \( s_t \) includes the current image \( x_t \), the time step \( t \), and the text condition \( c \).
2. **Objective Function**:
- Define an objective function \( e(\tau) \) with a time - discount factor to evaluate the quality of the generation trajectory.
- Introduce a KL regularization term to avoid generating unnatural images.
3. **Optimization Method**:
- Obtain the optimal policy \( \pi^* \) by solving the first - order conditions in the Lagrangian form.
- Use the approximate off - policy objective function and train the model by maximizing the lower bound.
4. **Experimental Setup**:
- In the single - prompt generation task, use the prompts in the DPOK dataset to verify the effectiveness of the method.
- In the multi - prompt generation task, use the HPSv2 dataset to further verify the generalization ability of the method.
### Experimental Results
- **Single - Prompt Generation**: On multiple tasks (such as color, quantity, combination, and position), the proposed method outperforms the baseline methods in both ImageReward and Aesthetic scores, especially in terms of the fidelity and aesthetic quality of the generated images.
- **Multi - Prompt Generation**: On the HPSv2 dataset, the proposed method achieves the best results in both HPSv2 and Aesthetic scores, verifying the effectiveness of the method on complex tasks.
### Conclusion
This paper provides a more effective method for aligning text - to - image generation models with human preferences by introducing the dense - reward perspective and the time - discount factor. The experimental results show that this method is superior to existing methods in terms of the quality of generated images and preference alignment.