A Diffusion-based Method for Multi-turn Compositional Image Generation

Chao Wang

2023-11-14

Abstract:Multi-turn compositional image generation (M-CIG) is a challenging task that aims to iteratively manipulate a reference image given a modification text. While most of the existing methods for M-CIG are based on generative adversarial networks (GANs), recent advances in image generation have demonstrated the superiority of diffusion models over GANs. In this paper, we propose a diffusion-based method for M-CIG named conditional denoising diffusion with image compositional matching (CDD-ICM). We leverage CLIP as the backbone of image and text encoders, and incorporate a gated fusion mechanism, originally proposed for question answering, to compositionally fuse the reference image and the modification text at each turn of M-CIG. We introduce a conditioning scheme to generate the target image based on the fusion results. To prioritize the semantic quality of the generated target image, we learn an auxiliary image compositional match (ICM) objective, along with the conditional denoising diffusion (CDD) objective in a multi-task learning framework. Additionally, we also perform ICM guidance and classifier-free guidance to improve performance. Experimental results show that CDD-ICM achieves state-of-the-art results on two benchmark datasets for M-CIG, i.e., CoDraw and i-CLEVR.

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper aims to address the problem of Multi-turn Compositional Image Generation (M-CIG). Specifically, the authors propose a diffusion model-based approach called Conditional Denoising Diffusion with Image Compositional Matching (CDD-ICM) to iteratively modify a reference image to generate a target image. This method combines a CLIP encoder, a gated fusion mechanism, and a multi-task learning framework to improve the semantic quality of the generated images. The main issues include: 1. Lack of proper conditioning schemes: Existing conditioning methods do not handle the combination of images and text well. 2. Focus on the semantic quality of generated images: It is important not only to focus on visual quality but also to ensure that the generated images contain the required objects and that these objects form the desired topology. To address these issues, the authors propose CDD-ICM, which has the following features: - Innovatively applies diffusion models to the M-CIG task; - Establishes a multi-task learning framework where Image Compositional Matching (ICM) serves as an auxiliary objective to explicitly enhance conditioning; - Achieves state-of-the-art performance on two benchmark datasets, CoDraw and i-CLEVR. Through experimental validation, CDD-ICM outperforms existing Generative Adversarial Network (GAN)-based methods in terms of precision, recall, F1 score, and relational similarity.

A Diffusion-based Method for Multi-turn Compositional Image Generation

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

Generating Intermediate Representations for Compositional Text-To-Image Generation

CCDM: Continuous Conditional Diffusion Models for Image Generation

Progressive Compositionality In Text-to-Image Generative Models

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Conditional Text Image Generation with Diffusion Models

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Contextualized Diffusion Models for Text-Guided Image and Video Generation

Compositional Text-to-Image Generation with Dense Blob Representations

Controlled and Conditional Text to Image Generation with Diffusion Prior

Multi-Concept Customization of Text-to-Image Diffusion

CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

ControlCom: Controllable Image Composition using Diffusion Model

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation

ECNet: Effective Controllable Text-to-Image Diffusion Models

CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation