Abstract:Recently, text-to-image models based on diffusion have achieved remarkable success in generating high-quality images. However, the challenge of personalized, controllable generation of instances within these images remains an area in need of further development. In this paper, we present LocRef-Diffusion, a novel, tuning-free model capable of personalized customization of multiple instances' appearance and position within an image. To enhance the precision of instance placement, we introduce a Layout-net, which controls instance generation locations by leveraging both explicit instance layout information and an instance region cross-attention module. To improve the appearance fidelity to reference images, we employ an appearance-net that extracts instance appearance features and integrates them into the diffusion model through cross-attention mechanisms. We conducted extensive experiments on the COCO and OpenImages datasets, and the results demonstrate that our proposed method achieves state-of-the-art performance in layout and appearance guided generation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to achieve personalized and controllable generation of the positions and appearances of multiple instances in text - to - image generation models. Although the existing text - to - image generation techniques based on diffusion models are already able to generate high - quality images, in multi - instance generation tasks, the following challenges still exist: 1. **Insufficient positioning accuracy**: The generated objects cannot be precisely placed at the specified positions. 2. **Low fidelity of reference images**: The similarity between the generated objects and the reference images is low. 3. **Feature leakage problem**: When generating multiple instances, the features between different instances will be confused, resulting in unsatisfactory generation results. To solve these problems, the paper proposes the LocRef - Diffusion model. This model realizes high - fidelity multi - instance position and appearance control without fine - tuning by introducing two key components, Layout - net and Appearance - Net. ### Specific problem description 1. **Insufficient positioning accuracy**: - When existing methods generate multiple objects, they are often unable to accurately place the objects at the specified positions, resulting in large position deviations of the objects in the generated images. - For example, the generated objects may deviate from the specified bounding boxes, or their sizes may be inconsistent. 2. **Low fidelity of reference images**: - When using reference images to guide generation, existing methods have difficulty ensuring high similarity between the generated objects and the reference images, especially when generating multiple objects, the fidelity will further decrease. - This may lead to large differences in shape, color, or texture between the generated objects and the reference images. 3. **Feature leakage problem**: - When dealing with multiple instances, existing methods are prone to feature leakage, that is, the features between different instances interfere with each other, resulting in unclear or inaccurate generation results. - For example, the features of one instance may affect the generation of another instance, resulting in confusion between the generated objects. ### Solutions of LocRef - Diffusion 1. **Layout - net**: - It introduces explicit layout information and a region - aware cross - attention module to precisely control the generation positions of objects. - Layout - net ensures that the generated objects can be accurately placed at the specified positions by integrating explicit layout information into the diffusion process. 2. **Appearance - Net**: - Appearance - Net is used to extract the appearance features of instances from reference images and integrate them into the diffusion model through a cross - attention mechanism. - Appearance - Net can effectively extract foreground features while suppressing background noise, thereby increasing the similarity between the generated objects and the reference images. Through these improvements, the LocRef - Diffusion model can achieve higher positioning accuracy and better reference image fidelity in multi - instance generation tasks while avoiding the feature leakage problem. Experimental results show that the model performs better than existing methods on the COCO and OpenImages datasets.

LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation

DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Spatial-Aware Latent Initialization for Controllable Image Generation

Obtaining Favorable Layouts for Multiple Object Generation

LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models

InstanceDiffusion: Instance-level Control for Image Generation

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance

Layout Agnostic Scene Text Image Synthesis with Diffusion Models

Diffusion Cocktail: Mixing Domain-Specific Diffusion Models for Diversified Image Generations

Design Booster: A Text-Guided Diffusion Model for Image Translation with Spatial Layout Preservation

Continuous Layout Editing of Single Images with Diffusion Models

RefDrop: Controllable Consistency in Image or Video Generation via Reference Feature Guidance

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models