Faster Diffusion via Temporal Attention Decomposition

Haozhe Liu,Wentian Zhang,Jinheng Xie,Francesco Faccio,Mengmeng Xu,Tao Xiang,Mike Zheng Shou,Juan-Manuel Perez-Rua,Jürgen Schmidhuber
2024-07-18
Abstract:We explore the role of attention mechanism during inference in text-conditional diffusion models. Empirical observations suggest that cross-attention outputs converge to a fixed point after several inference steps. The convergence time naturally divides the entire inference process into two phases: an initial phase for planning text-oriented visual semantics, which are then translated into images in a subsequent fidelity-improving phase. Cross-attention is essential in the initial phase but almost irrelevant thereafter. However, self-attention initially plays a minor role but becomes crucial in the second phase. These findings yield a simple and training-free method known as temporally gating the attention (TGATE), which efficiently generates images by caching and reusing attention outputs at scheduled time steps. Experimental results show when widely applied to various existing text-conditional diffusion models, TGATE accelerates these models by 10%-50%. The code of TGATE is available at <a class="link-external link-https" href="https://github.com/HaozheLiu-ST/T-GATE" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the inference efficiency of text - conditional diffusion models in generating images while maintaining the quality of the generated images. Specifically, by analyzing the role of the attention mechanism in different inference steps, the author found that cross - attention and self - attention have different importance at different stages: 1. **Cross - Attention**: - In the semantic - planning phase, cross - attention is crucial for generating visual semantics aligned with the text. - In the fidelity - improving phase, the role of cross - attention gradually weakens and can even be negligible. 2. **Self - Attention**: - In the semantic - planning phase, the role of self - attention is relatively small. - In the fidelity - improving phase, self - attention becomes very important. Based on these observations, the author proposed a method called "Temporal Gating of Attention (T GATE)". T GATE significantly accelerates the inference process by caching and reusing the attention outputs at specific stages, reducing redundant calculations, and hardly affecting the quality of the generated images. ### Main Contributions - **Improving Inference Efficiency**: By caching and reusing the outputs of cross - attention and self - attention, unnecessary calculations are reduced and the inference speed is increased. - **No Need for Retraining**: T GATE is a method that does not require retraining the model and can be directly applied to existing diffusion models. - **Wide Applicability**: This method is applicable to multiple architectures (such as U - Net and Transformer) as well as different noise schedulers and acceleration methods. ### Experimental Results Experiments show that T GATE can significantly improve the inference speed on multiple state - of - the - art diffusion models (such as SD - 1.5, SD - 2.1, SDXL, PixArt - Alpha, etc.), while maintaining or slightly improving the quality of the generated images. For example, in the PixArt - Alpha model, using T GATE can reduce the inference time from 61.502 seconds to 32.827 seconds, and the amount of computation is also greatly reduced. ### Conclusion This research, through in - depth analysis of the role of the attention mechanism in different inference stages, proposed the T GATE method, which effectively improves the inference efficiency of text - conditional diffusion models and provides new ideas and technical means for faster image generation.