Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive

Yumeng Li,Margret Keuper,Dan Zhang,Anna Khoreva

2024-01-17

Abstract:Despite the recent advances in large-scale diffusion models, little progress has been made on the layout-to-image (L2I) synthesis task. Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. This limits their usability in practice. To mitigate this, we propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM). Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout. To encourage consistent adherence to the input layout over the sampling steps, we further introduce the multistep unrolling strategy. Instead of looking at a single timestep, we unroll a few steps recursively to imitate the inference process, and ask the discriminator to assess the alignment of denoised images with the layout over a certain time window. Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin (~12 mIoU points).

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily addresses two core issues in the Layout-to-Image (L2I) synthesis task: 1. **Layout Fidelity**: Current L2I models exhibit poor alignment between the generated images and the input layouts. 2. **Text Controllability**: Existing models lack sufficient capability to edit images based on text prompts. Specifically, the paper proposes integrating adversarial supervision into the training of traditional L2I diffusion models (referred to as ALDM) to improve the alignment between generated images and input layouts while maintaining the model's text controllability. Additionally, the paper introduces a multi-step unfolding strategy to ensure consistency throughout the sampling process. ### Main Contributions 1. **Adversarial Supervision**: Introduces an adversarial supervision mechanism to improve the training of traditional diffusion models, thereby enhancing layout alignment without sacrificing text controllability. 2. **Multi-step Unfolding Strategy**: Proposes a new multi-step unfolding strategy that encourages better layout consistency during training. 3. **Data Augmentation Effect**: Demonstrates the significant effect of data augmentation achieved through ALDM on semantic segmentation tasks, particularly in domain generalization (e.g., improving by approximately 12 mIoU points in the Cityscapes to ACDC generalization task).

Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive

LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

LayoutDM: Precision Multi-Scale Diffusion for Layout-to-Image

LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

Boundary Attention Constrained Zero-Shot Layout-To-Image Generation

Structure-Guided Adversarial Training of Diffusion Models

SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

Spatial-Aware Latent Initialization for Controllable Image Generation

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Continuous Layout Editing of Single Images with Diffusion Models

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

LayoutDM: Transformer-based Diffusion Model for Layout Generation

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

LLM-grounded Video Diffusion Models

LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Obtaining Favorable Layouts for Multiple Object Generation

Decoder-Only LLMs Are Better Controllers for Diffusion Models