Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects

Weimin Qiu,Jieke Wang,Meng Tang
2024-11-28
Abstract:Diffusion models have achieved unprecedented fidelity and diversity for synthesizing image, video, 3D assets, etc. However, subject mixing is a known and unresolved issue for diffusion-based image synthesis, particularly for synthesizing multiple similar-looking subjects. We propose Self-Cross diffusion guidance to penalize the overlap between cross-attention maps and aggregated self-attention maps. Compared to previous methods based on self-attention or cross-attention alone, our self-cross guidance is more effective in eliminating subject mixing. What's more, our guidance addresses mixing for all relevant patches of a subject beyond the most discriminant one, e.g., beak of a bird. We aggregate self-attention maps of automatically selected patches for a subject to form a region that the whole subject attends to. Our method is training-free and can boost the performance of any transformer-based diffusion model such as Stable Diffusion.% for synthesizing similar subjects. We also release a more challenging benchmark with many text prompts of similar-looking subjects and utilize GPT-4o for automatic and reliable evaluation. Extensive qualitative and quantitative results demonstrate the effectiveness of our Self-Cross guidance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the subject mixing problem that occurs in text - to - image generation based on diffusion models, especially when synthesizing multiple subjects with similar appearances. Specifically: 1. **Subject mixing problem**: When using text prompts to generate images containing multiple objects with similar appearances, existing methods often lead to feature confusion or mixing between different objects, making the generated images unfaithful to the original text descriptions. For example, when generating images of "leopards and tigers" or "squirrels and mice", a phenomenon of feature mixing between the two may occur. 2. **Limitations of existing methods**: Although many methods have attempted to enhance the consistency and accuracy of image synthesis through self - attention or cross - attention, these methods have not completely eliminated the subject mixing problem, especially when dealing with objects with similar appearances. 3. **Improve synthesis quality**: The authors propose a new method - Self - Cross diffusion guidance - to reduce the subject mixing phenomenon and improve the quality and consistency of the generated images. This method not only solves the subject mixing problem but also reduces subject neglect and improves the identifiability and fidelity of the generated objects. ### Main contributions of the paper 1. **Propose Self - Cross guidance**: By introducing self - cross guidance, the subject mixing problem of objects with similar appearances in image synthesis is effectively solved. This method reduces the feature overlap between different objects by regularizing the self - attention map and the cross - attention map. 2. **Training - free optimization method**: The Self - Cross guidance is a method that does not require additional training and can be directly applied to pre - trained diffusion models (such as Stable Diffusion), thus improving the performance of existing models. 3. **Release a new benchmark dataset**: To promote research on image synthesis of similar objects, the authors release a more challenging benchmark dataset - Similar Subjects Dataset (SSD) and use advanced visual - language models (such as GPT - 4o) for evaluation. 4. **Reduce subject neglect**: As a side effect, this method also reduces the subject neglect phenomenon and improves the existence and identifiability of the generated objects. Through these contributions, the paper provides an effective and innovative solution in the field of text - to - image synthesis, especially when dealing with objects with similar appearances.