SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Joseph Liu,Joshua Geddes,Ziyu Guo,Haomiao Jiang,Mahesh Kumar Nandwana
2024-11-16
Abstract:Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis. However, their inference process remains computationally expensive due to the repeated evaluation of resource-intensive attention and feed-forward modules. To address this, we introduce SmoothCache, a model-agnostic inference acceleration technique for DiT architectures. SmoothCache leverages the observed high similarity between layer outputs across adjacent diffusion timesteps. By analyzing layer-wise representation errors from a small calibration set, SmoothCache adaptively caches and reuses key features during inference. Our experiments demonstrate that SmoothCache achieves 8% to 71% speed up while maintaining or even improving generation quality across diverse modalities. We showcase its effectiveness on DiT-XL for image generation, Open-Sora for text-to-video, and Stable Audio Open for text-to-audio, highlighting its potential to enable real-time applications and broaden the accessibility of powerful DiT models.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that the inference process of Diffusion Transformers (DiT) in tasks such as image, video, and speech generation has an excessively high computational cost. Specifically, due to the need to repeatedly evaluate resource - intensive attention modules and feed - forward modules during the inference process, the computational efficiency is low, which limits the wide adoption of these models in practical applications. To address this challenge, the paper proposes a general - purpose inference acceleration technique named SmoothCache. SmoothCache adaptively caches and reuses key features by taking advantage of the high similarity of layer outputs between adjacent diffusion time steps, thereby reducing redundant computations and increasing the inference speed. Experimental results show that SmoothCache can achieve an acceleration effect of 8% to 71% while maintaining or even improving the generation quality, and is applicable to multi - modal data generation tasks. ### Main contributions of the paper: 1. **Generality**: SmoothCache is a model - independent caching strategy that can be applied to any DiT architecture without specific model assumptions or retraining. 2. **Adaptive caching**: By analyzing the layer representation errors on a small calibration set, SmoothCache can adaptively determine the optimal caching strength for different denoising stages. 3. **Performance improvement**: Experimental results show that SmoothCache can significantly accelerate the inference process in multiple tasks such as image generation, text - to - video, and text - to - audio while maintaining or improving the generation quality. 4. **Compatibility**: SmoothCache is compatible with various existing common solvers and can be combined with other optimization methods to further improve performance. ### Technical details: - **Observation of high similarity**: The paper finds that the layer outputs of adjacent time steps in the DiT model have high cosine similarity, indicating that there is computational redundancy in the diffusion process. - **Adaptive caching strategy**: By analyzing the layer representation errors on the calibration set, SmoothCache dynamically decides which layer outputs can be cached and reused. - **Computational savings**: The caching strategy specifically targets the computational bottlenecks in Transformers, such as self - attention layers and feed - forward layers, which usually consume a large amount of computational resources during model training and inference. ### Experimental verification: - **Multi - modal tasks**: The paper conducts experiments on multiple tasks such as image generation (DiT - XL), text - to - video (Open - Sora), and text - to - audio (Stable Audio Open). - **Performance comparison**: Compared with existing methods (such as FORA and L2C), SmoothCache shows better generation quality at the same acceleration ratio or achieves a faster inference speed at the same generation quality. In conclusion, through proposing the SmoothCache technique, this paper effectively solves the problem of low inference efficiency of DiT models in multi - modal data generation tasks, providing strong support for practical applications.