Abstract:One of the main drawback of diffusion models is the slow inference time for image generation. Among the most successful approaches to addressing this problem are distillation methods. However, these methods require considerable computational resources. In this paper, we take another approach to diffusion model acceleration. We conduct a comprehensive study of the UNet encoder and empirically analyze the encoder features. This provides insights regarding their changes during the inference process. In particular, we find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps. This insight motivates us to omit encoder computation at certain adjacent time-steps and reuse encoder features of previous time-steps as input to the decoder in multiple time-steps. Importantly, this allows us to perform decoder computation in parallel, further accelerating the denoising process. Additionally, we introduce a prior noise injection method to improve the texture details in the generated image. Besides the standard text-to-image task, we also validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation. Without utilizing any knowledge distillation technique, our approach accelerates both the Stable Diffusion (SD) and DeepFloyd-IF model sampling by 41$\%$ and 24$\%$ respectively, and DiT model sampling by 34$\%$, while maintaining high-quality generation performance.

What problem does this paper attempt to address?

This paper attempts to address the issue of slow inference speed in image generation using diffusion models. Specifically, the authors found that existing acceleration methods such as knowledge distillation require a large amount of computational resources and may affect the quality and diversity of generated images in some cases. Therefore, this paper proposes a new acceleration method by rethinking the role of the encoder in diffusion model inference to achieve faster image generation speed. ### Main Contributions: 1. **Comprehensive Empirical Study**: The authors conducted a detailed analysis of the UNet encoder features in pre-trained diffusion models and found that the encoder features change very little between different time steps, while the decoder features change significantly. 2. **Parallel Strategy**: Based on the above findings, the authors proposed a method to omit the encoder computation in adjacent time steps and reuse the encoder features from the previous time step as input. This allows the decoder computation to be performed in parallel across multiple time steps, significantly speeding up the denoising process. 3. **Prior Noise Injection**: To mitigate the degradation in the quality of generated images, the authors introduced a prior noise injection strategy to preserve the texture details of the generated images. 4. **Combination with Existing Methods**: This method can be combined with other existing acceleration methods (such as DDIM and DPM-Solver) to further improve the inference speed of diffusion models. ### Experimental Results: - **Standard Text-to-Image Generation Task**: On the Stable Diffusion and DeepFloyd-IF models, this method reduced the sampling time by 41% and 24%, respectively, while maintaining high-quality generation performance. - **Other Tasks**: This method also performed well in tasks such as text-to-video generation, personalized generation, and reference-guided generation, significantly improving inference speed. ### Summary: This paper proposes an efficient acceleration method by redesigning the computation flow of the encoder and decoder in diffusion models, which can significantly improve inference speed without sacrificing the quality of generated images. This method is not only applicable to standard text-to-image generation tasks but can also be extended to other conditional generation tasks.

Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference

Emage: Non-Autoregressive Text-to-Image Generation

Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures

Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner

AdaDiff: Adaptive Step Selection for Fast Diffusion.

DeepCache: Accelerating Diffusion Models for Free

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration

AdaDiff: Accelerating Diffusion Models through Step-Wise Adaptive Computation

Flexiffusion: Segment-wise Neural Architecture Search for Flexible Denoising Schedule

Accelerated Image-Aware Generative Diffusion Modeling

Efficiency-optimized Video Diffusion Models

StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

Relational Diffusion Distillation for Efficient Image Generation

SpeedUpNet: A Plug-and-Play Hyper-Network for Accelerating Text-to-Image Diffusion Models

Plug-and-Play Diffusion Distillation

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy

Rapid Diffusion: Building Domain-Specific Text-to-Image Synthesizers with Fast Inference Speed

Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation

BudgetFusion: Perceptually-Guided Adaptive Diffusion Models