Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive

Yumeng Li,Margret Keuper,Dan Zhang,Anna Khoreva
2024-01-17
Abstract:Despite the recent advances in large-scale diffusion models, little progress has been made on the layout-to-image (L2I) synthesis task. Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. This limits their usability in practice. To mitigate this, we propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM). Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout. To encourage consistent adherence to the input layout over the sampling steps, we further introduce the multistep unrolling strategy. Instead of looking at a single timestep, we unroll a few steps recursively to imitate the inference process, and ask the discriminator to assess the alignment of denoised images with the layout over a certain time window. Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin (~12 mIoU points).
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily addresses two core issues in the Layout-to-Image (L2I) synthesis task: 1. **Layout Fidelity**: Current L2I models exhibit poor alignment between the generated images and the input layouts. 2. **Text Controllability**: Existing models lack sufficient capability to edit images based on text prompts. Specifically, the paper proposes integrating adversarial supervision into the training of traditional L2I diffusion models (referred to as ALDM) to improve the alignment between generated images and input layouts while maintaining the model's text controllability. Additionally, the paper introduces a multi-step unfolding strategy to ensure consistency throughout the sampling process. ### Main Contributions 1. **Adversarial Supervision**: Introduces an adversarial supervision mechanism to improve the training of traditional diffusion models, thereby enhancing layout alignment without sacrificing text controllability. 2. **Multi-step Unfolding Strategy**: Proposes a new multi-step unfolding strategy that encourages better layout consistency during training. 3. **Data Augmentation Effect**: Demonstrates the significant effect of data augmentation achieved through ALDM on semantic segmentation tasks, particularly in domain generalization (e.g., improving by approximately 12 mIoU points in the Cityscapes to ACDC generalization task).