LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis

Peiang Zhao,Han Li,Ruiyang Jin,S. Kevin Zhou
2024-03-26
Abstract:Recent text-to-image diffusion models have reached an unprecedented level in generating high-quality images. However, their exclusive reliance on textual prompts often falls short in precise control of image compositions. In this paper, we propose LoCo, a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions. Specifically, we introduce a Localized Attention Constraint (LAC), leveraging semantic affinity between pixels in self-attention maps to create precise representations of desired objects and effectively ensure the accurate placement of objects in designated regions. We further propose a Padding Token Constraint (PTC) to leverage the semantic information embedded in previously neglected padding tokens, improving the consistency between object appearance and layout instructions. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods. Extensive experiments showcase the superiority of our approach, surpassing existing state-of-the-art training-free layout-to-image methods both qualitatively and quantitatively across multiple benchmarks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the challenges faced by text-to-image (T2I) diffusion models in generating high-quality images while relying on text prompts for spatial layout control. Specifically, existing methods control the position of objects in an image solely based on text prompts, which is neither precise nor efficient when dealing with complex scenes. For example, in movie poster design, the complex spatial relationships between multiple objects and their attributes are difficult to control accurately through simple text prompts. To solve this problem, researchers have explored layout-to-image synthesis (LIS) methods, allowing users to specify object positions through various forms of layout instructions (such as bounding boxes, semantic masks, or doodles). However, existing fully supervised LIS methods require a large amount of paired layout-image training data, which is both expensive and difficult to obtain in practical applications. Additionally, training and fine-tuning these models consume significant computational resources. To address the above issues, this paper proposes LoCo, a training-free layout-to-image synthesis method that can generate high-quality images conforming to text prompts and layout instructions without relying on additional training. Specifically, LoCo introduces two new constraint mechanisms: Localized Attention Constraint (LLAC) and Padding Tokens Constraint (LPTC). LLAC ensures the accuracy of generated object positions through self-attention enhancement, while LPTC utilizes the semantic information in previously ignored padding tokens to improve the consistency between object appearance and layout instructions. Experimental results show that LoCo outperforms existing training-free layout-to-image synthesis methods in multiple benchmarks, with significant improvements in both quantitative and qualitative evaluations. Moreover, it can be integrated as a plugin into fully supervised LIS methods to further enhance their performance.