Layout Control and Semantic Guidance with Attention Loss Backward for T2I Diffusion Model

Guandong Li
2024-11-11
Abstract:Controllable image generation has always been one of the core demands in image generation, aiming to create images that are both creative and logical while satisfying additional specified conditions. In the post-AIGC era, controllable generation relies on diffusion models and is accomplished by maintaining certain components or introducing inference interferences. This paper addresses key challenges in controllable generation: 1. mismatched object attributes during generation and poor prompt-following effects; 2. inadequate completion of controllable layouts. We propose a train-free method based on attention loss backward, cleverly controlling the cross attention map. By utilizing external conditions such as prompts that can reasonably map onto the attention map, we can control image generation without any training or fine-tuning. This method addresses issues like attribute mismatch and poor prompt-following while introducing explicit layout constraints for controllable image generation. Our approach has achieved excellent practical applications in production, and we hope it can serve as an inspiring technical report in this field.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on two core challenges in controllable image generation: 1. **Attribute Mismatch and Poor Prompt - Word Following Effect**: - During the image generation process, the model may incorrectly bind attributes to the wrong objects or be completely unable to bind these attributes correctly. This results in the generated image not semantically meeting the requirements of the prompt words. - Formula representation: If a certain topic word \( S \) is ignored at the current time step, the optimized loss function can be defined as: \[ L=\max L_s \quad \text{where} \quad L_s = 1-\max (\text{attention map}) \] Here, \( L_s \) measures the gap between the maximum attention value of the corresponding topic word in the attention map and the ideal value. 2. **Insufficient Layout Control**: - When using prompt words for description, there is a lack of explicit layout input control, resulting in the inability to capture the layout information of certain spatial relationships, making the generated image unable to meet specific spatial layout requirements. - Formula representation: To evaluate the degree of aggregation of the cross - attention of the specified token within the specified box \( B \), an energy function is defined: \[ E(A, B, i)=\left(1-\frac{\sum_{p \in B}\text{AttentionMap}_p}{\sum\text{AttentionMap}_p}\right)^2 \] By optimizing this function, the cross - attention value of the \( i \) - th token within the specified area \( B \) can be increased, thereby guiding the image to be generated according to the layout. ### Solutions To solve the above problems, the paper proposes a train - free method based on attention - loss back - propagation, specifically including: - **Semantic Guidance**: - Use cross - attention map information to adjust the intermediate latent variables in the denoising process, enhance the mapping relationship between the text prompt and the activation values in the activation map, and thus guide the model to generate all the described topics. - Optimize the generation result by updating the latent variable \( Z_t \) with the gradient: \[ Z_t'=Z_t-\alpha_t\nabla_{z_t}L \] where \( \alpha_t \) is the gradient update step size. - **Layout Control**: - Explicitly introduce layout information, sample from an additional controlled distribution to guide the layout in the generation process. - Use the text tokens selected corresponding to the user - specified layout, and adjust the spatial layout of the generated image through cross - attention. Through these methods, the paper aims to simultaneously improve attribute matching and layout optimization, enhance the effect of controllable image generation, and without the need to train or fine - tune the model.