Abstract:Despite the remarkable performance of text-to-image diffusion models in image generation tasks, recent studies have raised the issue that generated images sometimes cannot capture the intended semantic contents of the text prompts, which phenomenon is often called semantic misalignment. To address this, here we present a novel energy-based model (EBM) framework for adaptive context control by modeling the posterior of context vectors. Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder. Then, we obtain the gradient of the log posterior of context vectors, which can be updated and transferred to the subsequent cross-attention layer, thereby implicitly minimizing a nested hierarchy of energy functions. Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts. Using extensive experiments, we demonstrate that the proposed method is highly effective in handling various image generation tasks, including multi-concept generation, text-guided image inpainting, and real and synthetic image editing. Code: <a class="link-external link-https" href="https://github.com/EnergyAttention/Energy-Based-CrossAttention" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the common semantic misalignment problem in text - to - image generation tasks. Specifically, although current text - to - image diffusion models perform well in image generation tasks, the generated images sometimes fail to accurately capture the expected semantic content in the text prompts. This phenomenon is called **semantic misalignment** (semantic misalignment). For example, some concepts may be ignored in the generated images (catastrophic neglect), or in multi - modal inpainting tasks, the model may not be able to accurately fill the masked area according to the text prompts. To address this challenge, the authors propose an energy - based cross - attention framework (Energy - Based Cross Attention, EBCA), which achieves adaptive context control by modeling the posterior distribution of context vectors. The specific methods are as follows: 1. **Energy - based Bayesian Context Update (EBCU)**: - In the cross - attention layer of each denoising auto - encoder, build an energy - based model (EBM) of the latent image representation and text embedding. - Establish the correspondence between the context and the representation by minimizing the parameterized energy function, thereby updating the context vector. - Implicitly minimize the nested energy function hierarchy by obtaining the gradient of the context vector log - posterior. 2. **Energy - based Composition of Queries (EBCQ)**: - Utilize the EBM in the cross - attention space to achieve zero - shot combinatorial generation. - Achieve the combinatorial generation of multiple editing prompts by linearly combining the cross - attention outputs of different contexts. ### Main contributions 1. **Solve the semantic misalignment problem**: Improve the semantic consistency between the generated image and the text prompt through adaptive context control. 2. **Zero - shot combinatorial generation**: Utilize the intrinsic combinability of EBM to achieve convenient integration of multiple distributions without additional training. 3. **Wide applicability**: This method can be seamlessly integrated into existing text - to - image diffusion models without additional training. ### Experimental verification The authors verified the effectiveness of the proposed method through a variety of experiments, including multi - concept generation, text - guided image inpainting, and combinatorial generation tasks. The experimental results show that the proposed method performs well in handling various image generation tasks and can significantly improve the semantic alignment of the generated images. ### Formula summary - **Energy function**: \[ E(Q; K)=\frac{\alpha}{2}\text{diag}(K K^T)-\sum_{i = 1}^{N}\log\sum_{j = 1}^{P^2}\exp(\beta q_j^T k_i) \] \[ E(K)=\log\sum_{i = 1}^{N}\exp\left(\frac{1}{2}k_i k_i^T\right) \] - **Posterior log - gradient**: \[ \nabla_K\log p(K|Q)=\nabla_K\log p(Q|K)+\nabla_K\log p(K)=-\nabla_K E(Q; K)+\nabla_K E(K) \] - **Context update rule**: \[ C_{n + 1}=C_n+\gamma\left(\text{softmax}_2(\beta K Q^T)Q W_K^T-(\alpha I + D(\text{softmax}(\frac{1}{2}\text{diag}(K K^T))))K W_K^T\right) \] - **Combined energy function**: \[ \hat{E}(Q;\{K_s\}_{s = 1}^M)=\frac{1}{M}\

Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

Emage: Non-Autoregressive Text-to-Image Generation

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Text-to-Image Generation Via Energy-Based CLIP

Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models

Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing

Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy Curvature of Attention

Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

Contextualized Diffusion Models for Text-Guided Image and Video Generation

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

ECNet: Effective Controllable Text-to-Image Diffusion Models

Training Energy-Based Models with Diffusion Contrastive Divergences

Eliminating Contextual Prior Bias for Semantic Image Editing via Dual-Cycle Diffusion

Improving Adversarial Energy-Based Model via Diffusion Process

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

Dense Text-to-Image Generation with Attention Modulation

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps

Persistently Trained, Diffusion-assisted Energy-based Models

EGC: Image Generation and Classification via a Diffusion Energy-Based Model