LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation

Fan Deng,Yaguang Wu,Xinyang Yu,Xiangjun Huang,Jian Yang,Guangyu Yan,Qiang Xu
2024-11-22
Abstract:Recently, text-to-image models based on diffusion have achieved remarkable success in generating high-quality images. However, the challenge of personalized, controllable generation of instances within these images remains an area in need of further development. In this paper, we present LocRef-Diffusion, a novel, tuning-free model capable of personalized customization of multiple instances' appearance and position within an image. To enhance the precision of instance placement, we introduce a Layout-net, which controls instance generation locations by leveraging both explicit instance layout information and an instance region cross-attention module. To improve the appearance fidelity to reference images, we employ an appearance-net that extracts instance appearance features and integrates them into the diffusion model through cross-attention mechanisms. We conducted extensive experiments on the COCO and OpenImages datasets, and the results demonstrate that our proposed method achieves state-of-the-art performance in layout and appearance guided generation.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to achieve personalized and controllable generation of the positions and appearances of multiple instances in text - to - image generation models. Although the existing text - to - image generation techniques based on diffusion models are already able to generate high - quality images, in multi - instance generation tasks, the following challenges still exist: 1. **Insufficient positioning accuracy**: The generated objects cannot be precisely placed at the specified positions. 2. **Low fidelity of reference images**: The similarity between the generated objects and the reference images is low. 3. **Feature leakage problem**: When generating multiple instances, the features between different instances will be confused, resulting in unsatisfactory generation results. To solve these problems, the paper proposes the LocRef - Diffusion model. This model realizes high - fidelity multi - instance position and appearance control without fine - tuning by introducing two key components, Layout - net and Appearance - Net. ### Specific problem description 1. **Insufficient positioning accuracy**: - When existing methods generate multiple objects, they are often unable to accurately place the objects at the specified positions, resulting in large position deviations of the objects in the generated images. - For example, the generated objects may deviate from the specified bounding boxes, or their sizes may be inconsistent. 2. **Low fidelity of reference images**: - When using reference images to guide generation, existing methods have difficulty ensuring high similarity between the generated objects and the reference images, especially when generating multiple objects, the fidelity will further decrease. - This may lead to large differences in shape, color, or texture between the generated objects and the reference images. 3. **Feature leakage problem**: - When dealing with multiple instances, existing methods are prone to feature leakage, that is, the features between different instances interfere with each other, resulting in unclear or inaccurate generation results. - For example, the features of one instance may affect the generation of another instance, resulting in confusion between the generated objects. ### Solutions of LocRef - Diffusion 1. **Layout - net**: - It introduces explicit layout information and a region - aware cross - attention module to precisely control the generation positions of objects. - Layout - net ensures that the generated objects can be accurately placed at the specified positions by integrating explicit layout information into the diffusion process. 2. **Appearance - Net**: - Appearance - Net is used to extract the appearance features of instances from reference images and integrate them into the diffusion model through a cross - attention mechanism. - Appearance - Net can effectively extract foreground features while suppressing background noise, thereby increasing the similarity between the generated objects and the reference images. Through these improvements, the LocRef - Diffusion model can achieve higher positioning accuracy and better reference image fidelity in multi - instance generation tasks while avoiding the feature leakage problem. Experimental results show that the model performs better than existing methods on the COCO and OpenImages datasets.