Abstract:We explore the role of attention mechanism during inference in text-conditional diffusion models. Empirical observations suggest that cross-attention outputs converge to a fixed point after several inference steps. The convergence time naturally divides the entire inference process into two phases: an initial phase for planning text-oriented visual semantics, which are then translated into images in a subsequent fidelity-improving phase. Cross-attention is essential in the initial phase but almost irrelevant thereafter. However, self-attention initially plays a minor role but becomes crucial in the second phase. These findings yield a simple and training-free method known as temporally gating the attention (TGATE), which efficiently generates images by caching and reusing attention outputs at scheduled time steps. Experimental results show when widely applied to various existing text-conditional diffusion models, TGATE accelerates these models by 10%-50%. The code of TGATE is available at <a class="link-external link-https" href="https://github.com/HaozheLiu-ST/T-GATE" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the inference efficiency of text - conditional diffusion models in generating images while maintaining the quality of the generated images. Specifically, by analyzing the role of the attention mechanism in different inference steps, the author found that cross - attention and self - attention have different importance at different stages: 1. **Cross - Attention**: - In the semantic - planning phase, cross - attention is crucial for generating visual semantics aligned with the text. - In the fidelity - improving phase, the role of cross - attention gradually weakens and can even be negligible. 2. **Self - Attention**: - In the semantic - planning phase, the role of self - attention is relatively small. - In the fidelity - improving phase, self - attention becomes very important. Based on these observations, the author proposed a method called "Temporal Gating of Attention (T GATE)". T GATE significantly accelerates the inference process by caching and reusing the attention outputs at specific stages, reducing redundant calculations, and hardly affecting the quality of the generated images. ### Main Contributions - **Improving Inference Efficiency**: By caching and reusing the outputs of cross - attention and self - attention, unnecessary calculations are reduced and the inference speed is increased. - **No Need for Retraining**: T GATE is a method that does not require retraining the model and can be directly applied to existing diffusion models. - **Wide Applicability**: This method is applicable to multiple architectures (such as U - Net and Transformer) as well as different noise schedulers and acceleration methods. ### Experimental Results Experiments show that T GATE can significantly improve the inference speed on multiple state - of - the - art diffusion models (such as SD - 1.5, SD - 2.1, SDXL, PixArt - Alpha, etc.), while maintaining or slightly improving the quality of the generated images. For example, in the PixArt - Alpha model, using T GATE can reduce the inference time from 61.502 seconds to 32.827 seconds, and the amount of computation is also greatly reduced. ### Conclusion This research, through in - depth analysis of the role of the attention mechanism in different inference stages, proposed the T GATE method, which effectively improves the inference efficiency of text - conditional diffusion models and provides new ideas and technical means for faster image generation.

Faster Diffusion via Temporal Attention Decomposition

Gdformer:A Graph Diffusing Attention Based Approach for Traffic Flow Prediction

Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis

Towards Better Text-to-Image Generation Alignment via Attention Modulation

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning

AID: Attention Interpolation of Text-to-Image Diffusion

Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference

Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement

Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models

Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing

What the DAAM: Interpreting Stable Diffusion Using Cross Attention

Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models

Scene Graph Conditioning in Latent Diffusion

Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators