Abstract:Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis. However, their inference process remains computationally expensive due to the repeated evaluation of resource-intensive attention and feed-forward modules. To address this, we introduce SmoothCache, a model-agnostic inference acceleration technique for DiT architectures. SmoothCache leverages the observed high similarity between layer outputs across adjacent diffusion timesteps. By analyzing layer-wise representation errors from a small calibration set, SmoothCache adaptively caches and reuses key features during inference. Our experiments demonstrate that SmoothCache achieves 8% to 71% speed up while maintaining or even improving generation quality across diverse modalities. We showcase its effectiveness on DiT-XL for image generation, Open-Sora for text-to-video, and Stable Audio Open for text-to-audio, highlighting its potential to enable real-time applications and broaden the accessibility of powerful DiT models.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that the inference process of Diffusion Transformers (DiT) in tasks such as image, video, and speech generation has an excessively high computational cost. Specifically, due to the need to repeatedly evaluate resource - intensive attention modules and feed - forward modules during the inference process, the computational efficiency is low, which limits the wide adoption of these models in practical applications. To address this challenge, the paper proposes a general - purpose inference acceleration technique named SmoothCache. SmoothCache adaptively caches and reuses key features by taking advantage of the high similarity of layer outputs between adjacent diffusion time steps, thereby reducing redundant computations and increasing the inference speed. Experimental results show that SmoothCache can achieve an acceleration effect of 8% to 71% while maintaining or even improving the generation quality, and is applicable to multi - modal data generation tasks. ### Main contributions of the paper: 1. **Generality**: SmoothCache is a model - independent caching strategy that can be applied to any DiT architecture without specific model assumptions or retraining. 2. **Adaptive caching**: By analyzing the layer representation errors on a small calibration set, SmoothCache can adaptively determine the optimal caching strength for different denoising stages. 3. **Performance improvement**: Experimental results show that SmoothCache can significantly accelerate the inference process in multiple tasks such as image generation, text - to - video, and text - to - audio while maintaining or improving the generation quality. 4. **Compatibility**: SmoothCache is compatible with various existing common solvers and can be combined with other optimization methods to further improve performance. ### Technical details: - **Observation of high similarity**: The paper finds that the layer outputs of adjacent time steps in the DiT model have high cosine similarity, indicating that there is computational redundancy in the diffusion process. - **Adaptive caching strategy**: By analyzing the layer representation errors on the calibration set, SmoothCache dynamically decides which layer outputs can be cached and reused. - **Computational savings**: The caching strategy specifically targets the computational bottlenecks in Transformers, such as self - attention layers and feed - forward layers, which usually consume a large amount of computational resources during model training and inference. ### Experimental verification: - **Multi - modal tasks**: The paper conducts experiments on multiple tasks such as image generation (DiT - XL), text - to - video (Open - Sora), and text - to - audio (Stable Audio Open). - **Performance comparison**: Compared with existing methods (such as FORA and L2C), SmoothCache shows better generation quality at the same acceleration ratio or achieves a faster inference speed at the same generation quality. In conclusion, through proposing the SmoothCache technique, this paper effectively solves the problem of low inference efficiency of DiT models in multi - modal data generation tasks, providing strong support for practical applications.

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

Accelerating Vision Diffusion Transformers with Skip Branches

Accelerating Diffusion Transformers with Token-wise Feature Caching

$Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

Token Caching for Diffusion Transformer Acceleration

DeepCache: Accelerating Diffusion Models for Free

DiTFastAttn: Attention Compression for Diffusion Transformer Models

FORA: Fast-Forward Caching in Diffusion Transformer Acceleration

Dynamic Diffusion Transformer

FlexDiT: Dynamic Token Density Control for Diffusion Transformer

Cache Me if You Can: Accelerating Diffusion Models through Block Caching

Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

DiffiT: Diffusion Vision Transformers for Image Generation

Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation