Abstract:Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs. To address this problem, feature caching methods have been introduced to accelerate diffusion transformers by caching the features in previous timesteps and reusing them in the following timesteps. However, previous caching methods ignore that different tokens exhibit different sensitivities to feature caching, and feature caching on some tokens may lead to 10$\times$ more destruction to the overall generation quality compared with other tokens. In this paper, we introduce token-wise feature caching, allowing us to adaptively select the most suitable tokens for caching, and further enable us to apply different caching ratios to neural layers in different types and depths. Extensive experiments on PixArt-$\alpha$, OpenSora, and DiT demonstrate our effectiveness in both image and video generation with no requirements for training. For instance, 2.36$\times$ and 1.93$\times$ acceleration are achieved on OpenSora and PixArt-$\alpha$ with almost no drop in generation quality.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the high computational cost of Diffusion Transformers in image and video generation. Despite the significant performance of Diffusion Transformers in generation tasks, their enormous computational cost results in slow inference speed, limiting their practical application in real-time scenarios. To tackle this challenge, the paper introduces a new method—**Token-wise Feature Caching (ToCa)**, which accelerates the process by adaptively selecting the most suitable tokens to cache at different time steps and neural layers. ### Main Contributions 1. **Proposing Token-wise Caching (ToCa)**: This is a fine-grained feature caching strategy specifically designed for accelerating Diffusion Transformers. To the best of the authors' knowledge, ToCa is the first to introduce the perspective of error propagation in feature caching methods. 2. **Defining Four Scoring Criteria**: These criteria are used to select the most suitable tokens for caching in each layer without additional computational cost. ToCa can also apply different caching ratios at different depths and types of layers, bringing a series of feature caching techniques. 3. **Extensive Experimental Validation**: A large number of experiments were conducted on models such as PixArt-α, OpenSora, and DiT, demonstrating that ToCa achieves a high acceleration ratio while maintaining almost lossless generation quality. For example, it achieved a 2.36x acceleration on OpenSora without requiring training. ### Method Overview 1. **Cache Initialization**: Similar to previous caching methods, all tokens are computed at the first time step, and the intermediate features of each self-attention, cross-attention, and MLP layer are stored in the cache. 2. **Using Cache for Computation**: In subsequent time steps, some unimportant tokens' computations are skipped by reusing the values in the cache. By defining a caching ratio R, it is decided which tokens should be cached and which should be computed. 3. **Cache Update**: Unlike traditional caching methods, ToCa can update the features in the cache at all time steps, thereby reducing the error introduced by feature reuse. 4. **Token Selection**: A caching scoring function S(xi) is defined, considering four factors: - **Impact on Other Tokens**: If a token significantly contributes to the values of other tokens, its caching error is likely to propagate to other tokens. - **Impact on Control Ability**: In text-to-image generation, the cross-attention layer reflects the influence of controlled signals (e.g., text) on each image token. Tokens significantly influenced by controlled signals are not suitable for caching. - **Caching Frequency**: Tokens recently cached are not suitable for caching again in subsequent layers and time steps, as their caching errors will quickly accumulate. - **Uniform Spatial Distribution**: Ensuring that the error introduced by caching does not concentrate in the same spatial region. ### Experimental Results - On models such as PixArt-α, OpenSora, and DiT, ToCa achieved significant acceleration while maintaining almost lossless generation quality. - For example, on OpenSora, ToCa achieved a 2.36x acceleration without requiring training, outperforming the method of directly halving the number of time steps. - On PariPrompt, ToCa even improved the CLIP score by 1.13x, indicating higher consistency between the generated results and the text conditions. ### Conclusion By introducing Token-wise Feature Caching (ToCa), the paper successfully addresses the high computational cost of Diffusion Transformers in image and video generation, achieving efficient acceleration while maintaining generation quality.

Accelerating Diffusion Transformers with Token-wise Feature Caching

Accelerating Diffusion Transformers with Dual Feature Caching

Token Caching for Diffusion Transformer Acceleration

Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free

HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Accelerating Vision Diffusion Transformers with Skip Branches

Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

FlexDiT: Dynamic Token Density Control for Diffusion Transformer

Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers

Cache Me if You Can: Accelerating Diffusion Models through Block Caching

FORA: Fast-Forward Caching in Diffusion Transformer Acceleration

$Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

Cached Adaptive Token Merging: Dynamic Token Reduction and Redundant Computation Elimination in Diffusion Model

Importance-based Token Merging for Diffusion Models

Token Merging for Fast Stable Diffusion

Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference

DeepCache: Accelerating Diffusion Models for Free