Freestyle Layout-to-Image Synthesis

Han Xue,Zhiwu Huang,Qianru Sun,Li Song,Wenjun Zhang
2023-03-25
Abstract:Typical layout-to-image synthesis (LIS) models generate images for a closed set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics (e.g., classes, attributes, and styles) onto a given layout, and call the task Freestyle LIS (FLIS). Thanks to the development of large-scale pre-trained language-image models, a number of discriminative models (e.g., image classification and object detection) trained on limited base classes are empowered with the ability of unseen class prediction. Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics. The key challenge of FLIS is how to enable the diffusion model to synthesize images from a specific layout which very likely violates its pre-learned knowledge, e.g., the model never sees "a unicorn sitting on a bench" during its pre-training. To this end, we introduce a new module called Rectified Cross-Attention (RCA) that can be conveniently plugged in the diffusion model to integrate semantic masks. This "plug-in" is applied in each cross-attention layer of the model to rectify the attention maps between image and text tokens. The key idea of RCA is to enforce each text token to act on the pixels in a specified region, allowing us to freely put a wide variety of semantics from pre-trained knowledge (which is general) onto the given layout (which is specific). Extensive experiments show that the proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs, which has a high potential to spawn a bunch of interesting applications. Code is available at <a class="link-external link-https" href="https://github.com/essunny310/FreestyleNet" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address a new task in Layout-to-Image Synthesis (LIS) called Freestyle Layout-to-Image Synthesis (FLIS). Traditional LIS models are limited to specific semantic categories within particular datasets, such as the 182 common objects in COCO-Stuff. The goal of FLIS is to generate unseen semantic categories (including categories, attributes, and styles) based on a given layout, thereby overcoming the distribution limitations of existing LIS models. #### Specific Objectives: 1. **Generate unseen semantic categories**: Utilize large-scale pre-trained language-image models (e.g., CLIP) to generate unseen semantic categories, enabling the model to produce more diverse images. 2. **Control the details of the output image**: Control specific attributes, styles, and objects in the image through text input, ensuring these elements are consistent with the layout. 3. **Improve the generality and controllability of the model**: Enable the model to generate not only semantic categories within specific datasets but also unseen semantic categories in open-set or long-tail semantic segmentation tasks. #### Method Overview: - **Propose a new module Rectified Cross-Attention (RCA)**: This module can be inserted into pre-trained text-to-image diffusion models to integrate layout information into the generation process. - **Utilize large-scale pre-trained models**: Employ pre-trained models like Stable Diffusion, leveraging their powerful generative capabilities to achieve synthesis for specific layouts through the RCA module. - **Experimental validation**: Conduct qualitative and quantitative experiments on COCO-Stuff and ADE20K datasets to demonstrate the superior performance of the model in generating high-quality images, and compare it with other existing methods. #### Main Contributions: 1. Propose a new LIS task, FLIS, utilizing large-scale pre-trained text-to-image diffusion models to achieve freestyle layout-to-image synthesis. 2. Introduce a parameter-free RCA module that can effectively integrate input layouts into pre-trained models to generate high-quality images. 3. Experiments show that this method can generate high-fidelity images with a wide range of novel semantics, surpassing the capabilities of existing LIS models.