Abstract:Recent text-to-image diffusion models excel at generating high-resolution images from text but struggle with precise control over spatial composition and object counting. To address these challenges, several studies developed layout-to-image (L2I) approaches that incorporate layout instructions into text-to-image models. However, existing L2I methods typically require either fine-tuning pretrained parameters or training additional control modules for the diffusion models. In this work, we propose a novel zero-shot L2I approach, BACON (Boundary Attention Constrained generation), which eliminates the need for additional modules or fine-tuning. Specifically, we use text-visual cross-attention feature maps to quantify inconsistencies between the layout of the generated images and the provided instructions, and then compute loss functions to optimize latent features during the diffusion reverse process. To enhance spatial controllability and mitigate semantic failures in complex layout instructions, we leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features. Comprehensive experimental results on both L2I and non-L2I pretrained diffusion models demonstrate that our method outperforms existing zero-shot L2I techniuqes both quantitatively and qualitatively in terms of image composition on the DrawBench and HRS benchmarks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that existing Layout - to - Image (L2I) generation methods have difficulty in precisely controlling spatial composition and object count when generating images. Specifically, existing L2I methods usually need to fine - tune pre - trained parameters or train additional control modules for diffusion models, which limits their application in data - scarce or resource - constrained scenarios. To address these challenges, the authors propose a new zero - shot L2I method - BACON (Boundary Attention Constrained generation). BACON aims to eliminate the need for additional modules or fine - tuning and improve the quality of image generation in the following ways: 1. **Using text - visual cross - attention feature maps**: Quantify the inconsistency between the generated image layout and the given instructions. 2. **Calculating the loss function**: Optimize the latent features in the diffusion inverse process. 3. **Utilizing pixel - to - pixel correlations**: Align the cross - attention maps in the self - attention feature maps, and update the latent features by combining three loss functions constrained by boundary attention to enhance spatial controllability and reduce semantic errors in complex layout instructions. ### Specific Problems and Solutions #### 1. Precise Control of Spatial Composition and Object Count Existing L2I methods often have the problem of inaccurate object count when dealing with complex layout instructions. For example, when the bounding boxes of multiple objects are closely arranged, the cross - attention maps may overlap, resulting in an incorrect number of objects in the generated image. To this end, BACON introduces **boundary attention constraints** to ensure that the cross - attention map of each object remains within its specified bounding box and promotes the separation of multiple objects under the same concept. #### 2. Precise Alignment of Object Size and Position Objects generated by existing methods are often larger than the specified bounding boxes or misaligned in position. BACON improves the precise alignment of object size and position by **self - attention enhancement** to filter noisy cross - attention maps and enhance the low - attention scores at the object edges. #### 3. Zero - shot Learning BACON can guide non - L2I diffusion models (such as Stable Diffusion) to generate images according to layout instructions without additional supervised training, and enhance the spatial control ability of L2I models (such as GLIGEN). ### Experimental Results The experimental results show that BACON outperforms existing zero - shot L2I techniques both quantitatively and qualitatively in the DrawBench and HRS benchmarks. Specifically, it shows a significant improvement in image composition (spatial relationships, size, color, object count). ### Main Contributions 1. **Research on semantic failure problems under complex layout inputs**: In particular, the problem of inaccurate object count due to overlapping cross - attention maps. 2. **Propose a novel method BACON**: Filter noisy cross - attention maps through self - attention enhancement and introduce boundary attention constraints to prevent cross - attention map overlap. 3. **Comprehensive experimental verification**: Compared with existing L2I methods, BACON has achieved state - of - the - art performance in image composition. Through these improvements, BACON significantly improves the accuracy and quality of zero - shot L2I generation, especially in the performance under complex layout instructions.

Boundary Attention Constrained Zero-Shot Layout-To-Image Generation

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

Training-free Composite Scene Generation for Layout-to-Image Synthesis

Spatial-Aware Latent Initialization for Controllable Image Generation

LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis

Zero-shot spatial layout conditioning for text-to-image diffusion models

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive

SpotActor: Training-Free Layout-Controlled Consistent Image Generation

Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis

Layout Control and Semantic Guidance with Attention Loss Backward for T2I Diffusion Model

Enhancing Image Layout Control with Loss-Guided Diffusion Models

Training-Free Layout Control with Cross-Attention Guidance

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance

Zero-shot Text-guided Infinite Image Synthesis with LLM guidance

Diffusion Self-Distillation for Zero-Shot Customized Image Generation

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis