Abstract:Advanced diffusion models like RPG, Stable Diffusion 3 and FLUX have made notable strides in compositional text-to-image generation. However, these methods typically exhibit distinct strengths for compositional generation, with some excelling in handling attribute binding and others in spatial relationships. This disparity highlights the need for an approach that can leverage the complementary strengths of various models to comprehensively improve the composition capability. To this end, we introduce IterComp, a novel framework that aggregates composition-aware model preferences from multiple models and employs an iterative feedback learning approach to enhance compositional generation. Specifically, we curate a gallery of six powerful open-source diffusion models and evaluate their three key compositional metrics: attribute binding, spatial relationships, and non-spatial relationships. Based on these metrics, we develop a composition-aware model preference dataset comprising numerous image-rank pairs to train composition-aware reward models. Then, we propose an iterative feedback learning method to enhance compositionality in a closed-loop manner, enabling the progressive self-refinement of both the base diffusion model and reward models over multiple iterations. Theoretical proof demonstrates the effectiveness and extensive experiments show our significant superiority over previous SOTA methods (e.g., Omost and FLUX), particularly in multi-category object composition and complex semantic alignment. IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation. Code: <a class="link-external link-https" href="https://github.com/YangLing0818/IterComp" rel="external noopener nofollow">this https URL</a>

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

Zero-shot Composed Text-Image Retrieval

RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models

CamDiff: Camouflage Image Augmentation via Diffusion Model

ControlCom: Controllable Image Composition using Diffusion Model

SceneDiff: Generative Scene-Level Image Retrieval with Text and Sketch Using Diffusion Models

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion

CLIP-Based Composed Image Retrieval with Comprehensive Fusion and Data Augmentation.

TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

OneDiff: A Generalist Model for Image Difference Captioning

Compositional Image Decomposition with Diffusion Models

Collaborative group: Composed image retrieval via consensus learning from noisy annotations

Cross-domain Compositing with Pretrained Diffusion Models

Language-only Efficient Training of Zero-shot Composed Image Retrieval

Pretrain like You Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval

MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration