A Diffusion-based Method for Multi-turn Compositional Image Generation

Chao Wang
2023-11-14
Abstract:Multi-turn compositional image generation (M-CIG) is a challenging task that aims to iteratively manipulate a reference image given a modification text. While most of the existing methods for M-CIG are based on generative adversarial networks (GANs), recent advances in image generation have demonstrated the superiority of diffusion models over GANs. In this paper, we propose a diffusion-based method for M-CIG named conditional denoising diffusion with image compositional matching (CDD-ICM). We leverage CLIP as the backbone of image and text encoders, and incorporate a gated fusion mechanism, originally proposed for question answering, to compositionally fuse the reference image and the modification text at each turn of M-CIG. We introduce a conditioning scheme to generate the target image based on the fusion results. To prioritize the semantic quality of the generated target image, we learn an auxiliary image compositional match (ICM) objective, along with the conditional denoising diffusion (CDD) objective in a multi-task learning framework. Additionally, we also perform ICM guidance and classifier-free guidance to improve performance. Experimental results show that CDD-ICM achieves state-of-the-art results on two benchmark datasets for M-CIG, i.e., CoDraw and i-CLEVR.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the problem of Multi-turn Compositional Image Generation (M-CIG). Specifically, the authors propose a diffusion model-based approach called Conditional Denoising Diffusion with Image Compositional Matching (CDD-ICM) to iteratively modify a reference image to generate a target image. This method combines a CLIP encoder, a gated fusion mechanism, and a multi-task learning framework to improve the semantic quality of the generated images. The main issues include: 1. Lack of proper conditioning schemes: Existing conditioning methods do not handle the combination of images and text well. 2. Focus on the semantic quality of generated images: It is important not only to focus on visual quality but also to ensure that the generated images contain the required objects and that these objects form the desired topology. To address these issues, the authors propose CDD-ICM, which has the following features: - Innovatively applies diffusion models to the M-CIG task; - Establishes a multi-task learning framework where Image Compositional Matching (ICM) serves as an auxiliary objective to explicitly enhance conditioning; - Achieves state-of-the-art performance on two benchmark datasets, CoDraw and i-CLEVR. Through experimental validation, CDD-ICM outperforms existing Generative Adversarial Network (GAN)-based methods in terms of precision, recall, F1 score, and relational similarity.