Abstract:Representing the cutting-edge technique of text-to-image models, the latest Multimodal Diffusion Transformer (MMDiT) largely mitigates many generation issues existing in previous models. However, we discover that it still suffers from subject neglect or mixing when the input text prompt contains multiple subjects of similar semantics or appearance. We identify three possible ambiguities within the MMDiT architecture that cause this problem: Inter-block Ambiguity, Text Encoder Ambiguity, and Semantic Ambiguity. To address these issues, we propose to repair the ambiguous latent on-the-fly by test-time optimization at early denoising steps. In detail, we design three loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss, each tailored to mitigate these ambiguities. Despite significant improvements, we observe that semantic ambiguity persists when generating multiple similar subjects, as the guidance provided by overlap loss is not explicit enough. Therefore, we further propose Overlap Online Detection and Back-to-Start Sampling Strategy to alleviate the problem. Experimental results on a newly constructed challenging dataset of similar subjects validate the effectiveness of our approach, showing superior generation quality and much higher success rates over existing methods. Our code will be available at <a class="link-external link-https" href="https://github.com/wtybest/EnMMDiT" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in text - to - image generation tasks, when the input text contains multiple semantically or appearance - similar topics, existing models based on Multimodal Diffusion Transformer (MMDiT) such as Stable Diffusion 3 still have the problems of topic neglect or confusion. Specifically, the author observes that when dealing with input texts containing two or more similar topics, these models perform poorly and the successful generation rate drops significantly (even below 20%). In order to understand the root cause of this problem, the author analyzes the relationship between generated pixels and different text tokens by visualizing the cross - modal attention part in the joint self - attention layer, thus revealing three possible ambiguities in the MMDiT architecture: Inter - block Ambiguity, Text Encoder Ambiguity and Semantic Ambiguity. These problems make it difficult for the model to accurately distinguish and represent multiple similar topics during the generation process. To solve these problems, the author proposes the following methods: 1. **Block Alignment Loss**: Through a self - refinement mechanism, use the average topic attention information of the later blocks to guide the transformation blocks of the earlier blocks to align, so as to reduce the semantic leakage caused by the earlier blocks. 2. **Text Encoder Alignment Loss**: Since it is difficult to determine which text encoder's activation is more reliable, an implicit constraint is adopted to prompt the CLIP text encoder and the T5 text encoder to reach a consistent response. 3. **Overlap Loss**: It is used to prevent different topics from being generated at the same position, thereby reducing semantic ambiguity. Although the above methods significantly improve the generation quality, the author finds that semantic ambiguity still exists when generating multiple similar topics because the guidance provided by the overlap loss is not clear enough. For this reason, the author further proposes **Overlap Online Detection** and **Back - to - Start Sampling Strategy**. By detecting overlaps early and restarting sampling, and applying the restricted loss guided by the conflict area mask, the overlap problem of similar topics can be avoided. The experimental results show that the methods proposed by the author significantly improve the generation quality and success rate on the newly constructed challenging dataset, and are superior to other existing methods.

Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

Diversified text-to-image generation via deep mutual information estimation

Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects

Detector Guidance for Multi-Object Text-to-Image Generation

Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models

MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration

Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

DMF-GAN: Deep Multimodal Fusion Generative Adversarial Networks for Text-to-Image Synthesis

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

Obtaining Favorable Layouts for Multiple Object Generation

Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

DreamTuner: Single Image is Enough for Subject-Driven Generation

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation