Abstract:Recent text-to-image diffusion models have reached an unprecedented level in generating high-quality images. However, their exclusive reliance on textual prompts often falls short in precise control of image compositions. In this paper, we propose LoCo, a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions. Specifically, we introduce a Localized Attention Constraint (LAC), leveraging semantic affinity between pixels in self-attention maps to create precise representations of desired objects and effectively ensure the accurate placement of objects in designated regions. We further propose a Padding Token Constraint (PTC) to leverage the semantic information embedded in previously neglected padding tokens, improving the consistency between object appearance and layout instructions. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods. Extensive experiments showcase the superiority of our approach, surpassing existing state-of-the-art training-free layout-to-image methods both qualitatively and quantitatively across multiple benchmarks.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the challenges faced by text-to-image (T2I) diffusion models in generating high-quality images while relying on text prompts for spatial layout control. Specifically, existing methods control the position of objects in an image solely based on text prompts, which is neither precise nor efficient when dealing with complex scenes. For example, in movie poster design, the complex spatial relationships between multiple objects and their attributes are difficult to control accurately through simple text prompts. To solve this problem, researchers have explored layout-to-image synthesis (LIS) methods, allowing users to specify object positions through various forms of layout instructions (such as bounding boxes, semantic masks, or doodles). However, existing fully supervised LIS methods require a large amount of paired layout-image training data, which is both expensive and difficult to obtain in practical applications. Additionally, training and fine-tuning these models consume significant computational resources. To address the above issues, this paper proposes LoCo, a training-free layout-to-image synthesis method that can generate high-quality images conforming to text prompts and layout instructions without relying on additional training. Specifically, LoCo introduces two new constraint mechanisms: Localized Attention Constraint (LLAC) and Padding Tokens Constraint (LPTC). LLAC ensures the accuracy of generated object positions through self-attention enhancement, while LPTC utilizes the semantic information in previously ignored padding tokens to improve the consistency between object appearance and layout instructions. Experimental results show that LoCo outperforms existing training-free layout-to-image synthesis methods in multiple benchmarks, with significant improvements in both quantitative and qualitative evaluations. Moreover, it can be integrated as a plugin into fully supervised LIS methods to further enhance their performance.

LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis

Image Synthesis from Layout with Locality-Aware Mask Adaption

Training-free Composite Scene Generation for Layout-to-Image Synthesis

LTOS: Layout-controllable Text-Object Synthesis via Adaptive Cross-attention Fusions

Layout-Bridging Text-to-Image Synthesis

Boundary Attention Constrained Zero-Shot Layout-To-Image Generation

Freestyle Layout-to-Image Synthesis

PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control

Spatial-Aware Latent Initialization for Controllable Image Generation

Layout2image: Image Generation from Layout

SpotActor: Training-Free Layout-Controlled Consistent Image Generation

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis

Enhancing Object Coherence in Layout-to-Image Synthesis

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Training-Free Layout Control with Cross-Attention Guidance

Object-driven Text-to-Image Synthesis via Adversarial Training

Locally controllable network based on visual–linguistic relation alignment for text-to-image generation

Learning Layout and Style Reconfigurable GANs for Controllable Image Synthesis