Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

Geon Yeong Park,Jeongsol Kim,Beomsu Kim,Sang Wan Lee,Jong Chul Ye
2023-11-05
Abstract:Despite the remarkable performance of text-to-image diffusion models in image generation tasks, recent studies have raised the issue that generated images sometimes cannot capture the intended semantic contents of the text prompts, which phenomenon is often called semantic misalignment. To address this, here we present a novel energy-based model (EBM) framework for adaptive context control by modeling the posterior of context vectors. Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder. Then, we obtain the gradient of the log posterior of context vectors, which can be updated and transferred to the subsequent cross-attention layer, thereby implicitly minimizing a nested hierarchy of energy functions. Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts. Using extensive experiments, we demonstrate that the proposed method is highly effective in handling various image generation tasks, including multi-concept generation, text-guided image inpainting, and real and synthetic image editing. Code: <a class="link-external link-https" href="https://github.com/EnergyAttention/Energy-Based-CrossAttention" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the common semantic misalignment problem in text - to - image generation tasks. Specifically, although current text - to - image diffusion models perform well in image generation tasks, the generated images sometimes fail to accurately capture the expected semantic content in the text prompts. This phenomenon is called **semantic misalignment** (semantic misalignment). For example, some concepts may be ignored in the generated images (catastrophic neglect), or in multi - modal inpainting tasks, the model may not be able to accurately fill the masked area according to the text prompts. To address this challenge, the authors propose an energy - based cross - attention framework (Energy - Based Cross Attention, EBCA), which achieves adaptive context control by modeling the posterior distribution of context vectors. The specific methods are as follows: 1. **Energy - based Bayesian Context Update (EBCU)**: - In the cross - attention layer of each denoising auto - encoder, build an energy - based model (EBM) of the latent image representation and text embedding. - Establish the correspondence between the context and the representation by minimizing the parameterized energy function, thereby updating the context vector. - Implicitly minimize the nested energy function hierarchy by obtaining the gradient of the context vector log - posterior. 2. **Energy - based Composition of Queries (EBCQ)**: - Utilize the EBM in the cross - attention space to achieve zero - shot combinatorial generation. - Achieve the combinatorial generation of multiple editing prompts by linearly combining the cross - attention outputs of different contexts. ### Main contributions 1. **Solve the semantic misalignment problem**: Improve the semantic consistency between the generated image and the text prompt through adaptive context control. 2. **Zero - shot combinatorial generation**: Utilize the intrinsic combinability of EBM to achieve convenient integration of multiple distributions without additional training. 3. **Wide applicability**: This method can be seamlessly integrated into existing text - to - image diffusion models without additional training. ### Experimental verification The authors verified the effectiveness of the proposed method through a variety of experiments, including multi - concept generation, text - guided image inpainting, and combinatorial generation tasks. The experimental results show that the proposed method performs well in handling various image generation tasks and can significantly improve the semantic alignment of the generated images. ### Formula summary - **Energy function**: \[ E(Q; K)=\frac{\alpha}{2}\text{diag}(K K^T)-\sum_{i = 1}^{N}\log\sum_{j = 1}^{P^2}\exp(\beta q_j^T k_i) \] \[ E(K)=\log\sum_{i = 1}^{N}\exp\left(\frac{1}{2}k_i k_i^T\right) \] - **Posterior log - gradient**: \[ \nabla_K\log p(K|Q)=\nabla_K\log p(Q|K)+\nabla_K\log p(K)=-\nabla_K E(Q; K)+\nabla_K E(K) \] - **Context update rule**: \[ C_{n + 1}=C_n+\gamma\left(\text{softmax}_2(\beta K Q^T)Q W_K^T-(\alpha I + D(\text{softmax}(\frac{1}{2}\text{diag}(K K^T))))K W_K^T\right) \] - **Combined energy function**: \[ \hat{E}(Q;\{K_s\}_{s = 1}^M)=\frac{1}{M}\