Abstract:Advanced diffusion models like RPG, Stable Diffusion 3 and FLUX have made notable strides in compositional text-to-image generation. However, these methods typically exhibit distinct strengths for compositional generation, with some excelling in handling attribute binding and others in spatial relationships. This disparity highlights the need for an approach that can leverage the complementary strengths of various models to comprehensively improve the composition capability. To this end, we introduce IterComp, a novel framework that aggregates composition-aware model preferences from multiple models and employs an iterative feedback learning approach to enhance compositional generation. Specifically, we curate a gallery of six powerful open-source diffusion models and evaluate their three key compositional metrics: attribute binding, spatial relationships, and non-spatial relationships. Based on these metrics, we develop a composition-aware model preference dataset comprising numerous image-rank pairs to train composition-aware reward models. Then, we propose an iterative feedback learning method to enhance compositionality in a closed-loop manner, enabling the progressive self-refinement of both the base diffusion model and reward models over multiple iterations. Theoretical proof demonstrates the effectiveness and extensive experiments show our significant superiority over previous SOTA methods (e.g., Omost and FLUX), particularly in multi-category object composition and complex semantic alignment. IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation. Code: <a class="link-external link-https" href="https://github.com/YangLing0818/IterComp" rel="external noopener nofollow">this https URL</a>

TreeReward: Improve Diffusion Model Via Tree-Structured Feedback Learning

UniFL: Improve Latent Diffusion Model via Unified Feedback Learning

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

Feedback Efficient Online Fine-Tuning of Diffusion Models

Large-scale Reinforcement Learning for Diffusion Models

RFSR: Improving ISR Diffusion Models via Reward Feedback Learning

Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback

Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

Diffusion Model Alignment Using Direct Preference Optimization

AdaDiff: Adaptive Step Selection for Fast Diffusion.

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

InstructVideo: Instructing Video Diffusion Models with Human Feedback

Analyzing and Improving the Training Dynamics of Diffusion Models

Reward Incremental Learning in Text-to-Image Generation

Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures

Relational Diffusion Distillation for Efficient Image Generation