Abstract:Typical layout-to-image synthesis (LIS) models generate images for a closed set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics (e.g., classes, attributes, and styles) onto a given layout, and call the task Freestyle LIS (FLIS). Thanks to the development of large-scale pre-trained language-image models, a number of discriminative models (e.g., image classification and object detection) trained on limited base classes are empowered with the ability of unseen class prediction. Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics. The key challenge of FLIS is how to enable the diffusion model to synthesize images from a specific layout which very likely violates its pre-learned knowledge, e.g., the model never sees "a unicorn sitting on a bench" during its pre-training. To this end, we introduce a new module called Rectified Cross-Attention (RCA) that can be conveniently plugged in the diffusion model to integrate semantic masks. This "plug-in" is applied in each cross-attention layer of the model to rectify the attention maps between image and text tokens. The key idea of RCA is to enforce each text token to act on the pixels in a specified region, allowing us to freely put a wide variety of semantics from pre-trained knowledge (which is general) onto the given layout (which is specific). Extensive experiments show that the proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs, which has a high potential to spawn a bunch of interesting applications. Code is available at <a class="link-external link-https" href="https://github.com/essunny310/FreestyleNet" rel="external noopener nofollow">this https URL</a>.

Learning Layout and Style Reconfigurable GANs for Controllable Image Synthesis

Image Synthesis From Reconfigurable Layout and Style

Image Synthesis from Layout with Locality-Aware Mask Adaption

Style Fader Generative Adversarial Networks for Style Degree Controllable Artistic Style Transfer

Interactive Image Synthesis with Panoptic Layout Generation

Customizable GAN: Customizable Image Synthesis Based on Adversarial Learning.

Composition-aware Graphic Layout GAN for Visual-textual Presentation Designs

LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators

Layout2image: Image Generation from Layout

Attribute-Conditioned Layout GAN for Automatic Graphic Design

Style Transformer for Image Inversion and Editing

Object-driven Text-to-Image Synthesis via Adversarial Training

Learning What and Where to Draw

LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

Style Separation and Synthesis Via Generative Adversarial Networks

DensityLayout: Density-Conditioned Layout GAN for Visual-Textual Presentation Designs.

Towards Spatially Disentangled Manipulation of Face Images With Pre-Trained StyleGANs

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

Image-aware Layout Generation with User Constraints for Poster Design

Learning Semantic-aware Normalization for Generative Adversarial Networks.

Freestyle Layout-to-Image Synthesis