Abstract:Diffusion models have achieved unprecedented fidelity and diversity for synthesizing image, video, 3D assets, etc. However, subject mixing is a known and unresolved issue for diffusion-based image synthesis, particularly for synthesizing multiple similar-looking subjects. We propose Self-Cross diffusion guidance to penalize the overlap between cross-attention maps and aggregated self-attention maps. Compared to previous methods based on self-attention or cross-attention alone, our self-cross guidance is more effective in eliminating subject mixing. What's more, our guidance addresses mixing for all relevant patches of a subject beyond the most discriminant one, e.g., beak of a bird. We aggregate self-attention maps of automatically selected patches for a subject to form a region that the whole subject attends to. Our method is training-free and can boost the performance of any transformer-based diffusion model such as Stable Diffusion.% for synthesizing similar subjects. We also release a more challenging benchmark with many text prompts of similar-looking subjects and utilize GPT-4o for automatic and reliable evaluation. Extensive qualitative and quantitative results demonstrate the effectiveness of our Self-Cross guidance.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the subject mixing problem that occurs in text - to - image generation based on diffusion models, especially when synthesizing multiple subjects with similar appearances. Specifically: 1. **Subject mixing problem**: When using text prompts to generate images containing multiple objects with similar appearances, existing methods often lead to feature confusion or mixing between different objects, making the generated images unfaithful to the original text descriptions. For example, when generating images of "leopards and tigers" or "squirrels and mice", a phenomenon of feature mixing between the two may occur. 2. **Limitations of existing methods**: Although many methods have attempted to enhance the consistency and accuracy of image synthesis through self - attention or cross - attention, these methods have not completely eliminated the subject mixing problem, especially when dealing with objects with similar appearances. 3. **Improve synthesis quality**: The authors propose a new method - Self - Cross diffusion guidance - to reduce the subject mixing phenomenon and improve the quality and consistency of the generated images. This method not only solves the subject mixing problem but also reduces subject neglect and improves the identifiability and fidelity of the generated objects. ### Main contributions of the paper 1. **Propose Self - Cross guidance**: By introducing self - cross guidance, the subject mixing problem of objects with similar appearances in image synthesis is effectively solved. This method reduces the feature overlap between different objects by regularizing the self - attention map and the cross - attention map. 2. **Training - free optimization method**: The Self - Cross guidance is a method that does not require additional training and can be directly applied to pre - trained diffusion models (such as Stable Diffusion), thus improving the performance of existing models. 3. **Release a new benchmark dataset**: To promote research on image synthesis of similar objects, the authors release a more challenging benchmark dataset - Similar Subjects Dataset (SSD) and use advanced visual - language models (such as GPT - 4o) for evaluation. 4. **Reduce subject neglect**: As a side effect, this method also reduces the subject neglect phenomenon and improves the existence and identifiability of the generated objects. Through these contributions, the paper provides an effective and innovative solution in the field of text - to - image synthesis, especially when dealing with objects with similar appearances.

Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects

Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance

Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing

Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis

Segmentation-Free Guidance for Text-to-Image Diffusion Models

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

Self-Guidance: Boosting Flow and Diffusion Generation on Their Own

Improving Multi-Subject Consistency in Open-Domain Image Generation with Isolation and Reposition Attention

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models

Self-Guided Diffusion Models

Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Obtaining Favorable Layouts for Multiple Object Generation

Diffusion Self-Guidance for Controllable Image Generation

Detector Guidance for Multi-Object Text-to-Image Generation

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps