Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference

Senmao Li,Taihang Hu,Joost van de Weijer,Fahad Shahbaz Khan,Tao Liu,Linxuan Li,Shiqi Yang,Yaxing Wang,Ming-Ming Cheng,Jian Yang
2024-10-15
Abstract:One of the main drawback of diffusion models is the slow inference time for image generation. Among the most successful approaches to addressing this problem are distillation methods. However, these methods require considerable computational resources. In this paper, we take another approach to diffusion model acceleration. We conduct a comprehensive study of the UNet encoder and empirically analyze the encoder features. This provides insights regarding their changes during the inference process. In particular, we find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps. This insight motivates us to omit encoder computation at certain adjacent time-steps and reuse encoder features of previous time-steps as input to the decoder in multiple time-steps. Importantly, this allows us to perform decoder computation in parallel, further accelerating the denoising process. Additionally, we introduce a prior noise injection method to improve the texture details in the generated image. Besides the standard text-to-image task, we also validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation. Without utilizing any knowledge distillation technique, our approach accelerates both the Stable Diffusion (SD) and DeepFloyd-IF model sampling by 41$\%$ and 24$\%$ respectively, and DiT model sampling by 34$\%$, while maintaining high-quality generation performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the issue of slow inference speed in image generation using diffusion models. Specifically, the authors found that existing acceleration methods such as knowledge distillation require a large amount of computational resources and may affect the quality and diversity of generated images in some cases. Therefore, this paper proposes a new acceleration method by rethinking the role of the encoder in diffusion model inference to achieve faster image generation speed. ### Main Contributions: 1. **Comprehensive Empirical Study**: The authors conducted a detailed analysis of the UNet encoder features in pre-trained diffusion models and found that the encoder features change very little between different time steps, while the decoder features change significantly. 2. **Parallel Strategy**: Based on the above findings, the authors proposed a method to omit the encoder computation in adjacent time steps and reuse the encoder features from the previous time step as input. This allows the decoder computation to be performed in parallel across multiple time steps, significantly speeding up the denoising process. 3. **Prior Noise Injection**: To mitigate the degradation in the quality of generated images, the authors introduced a prior noise injection strategy to preserve the texture details of the generated images. 4. **Combination with Existing Methods**: This method can be combined with other existing acceleration methods (such as DDIM and DPM-Solver) to further improve the inference speed of diffusion models. ### Experimental Results: - **Standard Text-to-Image Generation Task**: On the Stable Diffusion and DeepFloyd-IF models, this method reduced the sampling time by 41% and 24%, respectively, while maintaining high-quality generation performance. - **Other Tasks**: This method also performed well in tasks such as text-to-video generation, personalized generation, and reference-guided generation, significantly improving inference speed. ### Summary: This paper proposes an efficient acceleration method by redesigning the computation flow of the encoder and decoder in diffusion models, which can significantly improve inference speed without sacrificing the quality of generated images. This method is not only applicable to standard text-to-image generation tasks but can also be extended to other conditional generation tasks.