Abstract:Text-to-image (T2I) diffusion models have revolutionized generative modeling by producing high-fidelity, diverse, and visually realistic images from textual prompts. Despite these advances, existing models struggle with complex prompts involving multiple objects and attributes, often misaligning modifiers with their corresponding nouns or neglecting certain elements. Recent attention-based methods have improved object inclusion and linguistic binding, but still face challenges such as attribute misbinding and a lack of robust generalization guarantees. Leveraging the PAC-Bayes framework, we propose a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties, including divergence between objects, alignment between modifiers and their corresponding nouns, minimal attention to irrelevant tokens, and regularization for better generalization. Our approach treats the attention mechanism as an interpretable component, enabling fine-grained control and improved attribute-object alignment. We demonstrate the effectiveness of our method on standard benchmarks, achieving state-of-the-art results across multiple metrics. By integrating custom priors into the denoising process, our method enhances image quality and addresses long-standing challenges in T2I diffusion models, paving the way for more reliable and interpretable generative models.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to address the challenges encountered by text - to - image diffusion models when dealing with complex text prompts containing multiple objects and attributes. Specifically, the existing text - to - image diffusion models have the following deficiencies: 1. **Attribute - object alignment problem**: The model often binds attributes to incorrect nouns wrongly, resulting in some objects or attributes being ignored or misrepresented in the generated image. For example, when the prompt is "a mouse in a white spacesuit", the existing model may only generate a mouse or a spacesuit and fail to correctly associate the two. 2. **Black - box problem of the attention mechanism**: The existing attention mechanism lacks transparency, and users cannot clearly control the distribution of attention, which leads to poor performance of the model in tasks requiring precise modifier - object associations. 3. **Insufficient generalization ability**: Although the existing methods have been improved in some aspects, they still lack theoretical generalization guarantees and are prone to over - fitting problems. To address these challenges, the author proposes a Bayesian method based on the PAC - Bayes framework. By designing a custom prior distribution to guide the attention mechanism, it improves the balance of attribute - object alignment and attention distribution. Specifically, this method guides the model to achieve better attribute - object alignment and more reasonable attention distribution by minimizing the Kullback - Leibler divergence between the learned attention distribution and the custom prior distribution. ### Main contributions 1. **Introduction of the Bayesian framework**: Allows users to design custom prior distributions, enhances the control of the attention mechanism, and solves the black - box problem of the attention mechanism. 2. **Formalization of the problem**: Formalizes the problem under the PAC - Bayes framework and provides theoretical generalization guarantees. 3. **Empirical results**: Achieves state - of - the - art results on standard benchmark datasets, verifying the effectiveness of the method in improving attribute binding and attention distribution. ### Formula presentation 1. **PAC - Bayes bound**: \[ \mathbb{E}_{h\sim Q}[\text{Risk}(h)]\leq\mathbb{E}_{h\sim Q}[\hat{\text{Risk}}(h)]+\sqrt{\frac{D_{\text{KL}}(Q\|P)+\ln\left(\frac{2\sqrt{N}}{\delta}\right)}{2N}} \] where \(Q\) is the posterior distribution, \(\text{Risk}(h)\) is the true risk, \(\hat{\text{Risk}}(h)\) is the empirical risk, \(D_{\text{KL}}(Q\|P)\) is the KL divergence between \(Q\) and \(P\), \(N\) is the number of samples, and \(\delta\) is the confidence parameter. 2. **Total loss function**: \[ L_{\text{total}}=\lambda_{\text{div}}L_{\text{div}}+\lambda_{\text{sim}}L_{\text{sim}}+\lambda_{\text{out}}L_{\text{out}}+\lambda_{\text{PAC}}R_{\text{PAC}} \] where \(\lambda_{\text{div}}=\alpha\), \(\lambda_{\text{sim}}=\beta\), \(\lambda_{\text{out}}=\gamma\), \(\lambda_{\text{PAC}}=\eta\). 3. **Divergence loss**: \[ L_{\text{div}} = -\frac{1}{|P|}\sum_{(i, j)\in P}\frac{1}{2}\left[D_{\text{KL}}(A_

Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis

Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model

Text-image Alignment for Diffusion-based Perception

Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models

From Text to Pose to Image: Improving Diffusion Model Control and Quality

Towards Better Text-to-Image Generation Alignment via Attention Modulation

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

ECNet: Effective Controllable Text-to-Image Diffusion Models

Prompt-Free Diffusion: Taking "text" out of Text-to-Image Diffusion Models

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Text-driven Visual Synthesis with Latent Diffusion Prior

Controlled and Conditional Text to Image Generation with Diffusion Prior

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection

DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis