Abstract:Controllable image generation has always been one of the core demands in image generation, aiming to create images that are both creative and logical while satisfying additional specified conditions. In the post-AIGC era, controllable generation relies on diffusion models and is accomplished by maintaining certain components or introducing inference interferences. This paper addresses key challenges in controllable generation: 1. mismatched object attributes during generation and poor prompt-following effects; 2. inadequate completion of controllable layouts. We propose a train-free method based on attention loss backward, cleverly controlling the cross attention map. By utilizing external conditions such as prompts that can reasonably map onto the attention map, we can control image generation without any training or fine-tuning. This method addresses issues like attribute mismatch and poor prompt-following while introducing explicit layout constraints for controllable image generation. Our approach has achieved excellent practical applications in production, and we hope it can serve as an inspiring technical report in this field.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on two core challenges in controllable image generation: 1. **Attribute Mismatch and Poor Prompt - Word Following Effect**: - During the image generation process, the model may incorrectly bind attributes to the wrong objects or be completely unable to bind these attributes correctly. This results in the generated image not semantically meeting the requirements of the prompt words. - Formula representation: If a certain topic word \( S \) is ignored at the current time step, the optimized loss function can be defined as: \[ L=\max L_s \quad \text{where} \quad L_s = 1-\max (\text{attention map}) \] Here, \( L_s \) measures the gap between the maximum attention value of the corresponding topic word in the attention map and the ideal value. 2. **Insufficient Layout Control**: - When using prompt words for description, there is a lack of explicit layout input control, resulting in the inability to capture the layout information of certain spatial relationships, making the generated image unable to meet specific spatial layout requirements. - Formula representation: To evaluate the degree of aggregation of the cross - attention of the specified token within the specified box \( B \), an energy function is defined: \[ E(A, B, i)=\left(1-\frac{\sum_{p \in B}\text{AttentionMap}_p}{\sum\text{AttentionMap}_p}\right)^2 \] By optimizing this function, the cross - attention value of the \( i \) - th token within the specified area \( B \) can be increased, thereby guiding the image to be generated according to the layout. ### Solutions To solve the above problems, the paper proposes a train - free method based on attention - loss back - propagation, specifically including: - **Semantic Guidance**: - Use cross - attention map information to adjust the intermediate latent variables in the denoising process, enhance the mapping relationship between the text prompt and the activation values in the activation map, and thus guide the model to generate all the described topics. - Optimize the generation result by updating the latent variable \( Z_t \) with the gradient: \[ Z_t'=Z_t-\alpha_t\nabla_{z_t}L \] where \( \alpha_t \) is the gradient update step size. - **Layout Control**: - Explicitly introduce layout information, sample from an additional controlled distribution to guide the layout in the generation process. - Use the text tokens selected corresponding to the user - specified layout, and adjust the spatial layout of the generated image through cross - attention. Through these methods, the paper aims to simultaneously improve attribute matching and layout optimization, enhance the effect of controllable image generation, and without the need to train or fine - tune the model.

Layout Control and Semantic Guidance with Attention Loss Backward for T2I Diffusion Model

Training-Free Layout Control with Cross-Attention Guidance

Enhancing Image Layout Control with Loss-Guided Diffusion Models

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Directed Diffusion: Direct Control of Object Placement through Attention Guidance

Training-free Composite Scene Generation for Layout-to-Image Synthesis

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

Spatial-Aware Latent Initialization for Controllable Image Generation

Semantic Guidance Tuning for Text-To-Image Diffusion Models

Obtaining Favorable Layouts for Multiple Object Generation

Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

Controllable Generation with Text-to-Image Diffusion Models: A Survey

HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation

Adaptively Controllable Diffusion Model for Efficient Conditional Image Generation

LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation