Abstract:Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs. To address this problem, feature caching methods have been introduced to accelerate diffusion transformers by caching the features in previous timesteps and reusing them in the following timesteps. However, previous caching methods ignore that different tokens exhibit different sensitivities to feature caching, and feature caching on some tokens may lead to 10$\times$ more destruction to the overall generation quality compared with other tokens. In this paper, we introduce token-wise feature caching, allowing us to adaptively select the most suitable tokens for caching, and further enable us to apply different caching ratios to neural layers in different types and depths. Extensive experiments on PixArt-$\alpha$, OpenSora, and DiT demonstrate our effectiveness in both image and video generation with no requirements for training. For instance, 2.36$\times$ and 1.93$\times$ acceleration are achieved on OpenSora and PixArt-$\alpha$ with almost no drop in generation quality.
What problem does this paper attempt to address?
### Problems Addressed by the Paper
The paper aims to address the high computational cost of Diffusion Transformers in image and video generation. Despite the significant performance of Diffusion Transformers in generation tasks, their enormous computational cost results in slow inference speed, limiting their practical application in real-time scenarios. To tackle this challenge, the paper introduces a new method—**Token-wise Feature Caching (ToCa)**, which accelerates the process by adaptively selecting the most suitable tokens to cache at different time steps and neural layers.
### Main Contributions
1. **Proposing Token-wise Caching (ToCa)**: This is a fine-grained feature caching strategy specifically designed for accelerating Diffusion Transformers. To the best of the authors' knowledge, ToCa is the first to introduce the perspective of error propagation in feature caching methods.
2. **Defining Four Scoring Criteria**: These criteria are used to select the most suitable tokens for caching in each layer without additional computational cost. ToCa can also apply different caching ratios at different depths and types of layers, bringing a series of feature caching techniques.
3. **Extensive Experimental Validation**: A large number of experiments were conducted on models such as PixArt-α, OpenSora, and DiT, demonstrating that ToCa achieves a high acceleration ratio while maintaining almost lossless generation quality. For example, it achieved a 2.36x acceleration on OpenSora without requiring training.
### Method Overview
1. **Cache Initialization**: Similar to previous caching methods, all tokens are computed at the first time step, and the intermediate features of each self-attention, cross-attention, and MLP layer are stored in the cache.
2. **Using Cache for Computation**: In subsequent time steps, some unimportant tokens' computations are skipped by reusing the values in the cache. By defining a caching ratio R, it is decided which tokens should be cached and which should be computed.
3. **Cache Update**: Unlike traditional caching methods, ToCa can update the features in the cache at all time steps, thereby reducing the error introduced by feature reuse.
4. **Token Selection**: A caching scoring function S(xi) is defined, considering four factors:
- **Impact on Other Tokens**: If a token significantly contributes to the values of other tokens, its caching error is likely to propagate to other tokens.
- **Impact on Control Ability**: In text-to-image generation, the cross-attention layer reflects the influence of controlled signals (e.g., text) on each image token. Tokens significantly influenced by controlled signals are not suitable for caching.
- **Caching Frequency**: Tokens recently cached are not suitable for caching again in subsequent layers and time steps, as their caching errors will quickly accumulate.
- **Uniform Spatial Distribution**: Ensuring that the error introduced by caching does not concentrate in the same spatial region.
### Experimental Results
- On models such as PixArt-α, OpenSora, and DiT, ToCa achieved significant acceleration while maintaining almost lossless generation quality.
- For example, on OpenSora, ToCa achieved a 2.36x acceleration without requiring training, outperforming the method of directly halving the number of time steps.
- On PariPrompt, ToCa even improved the CLIP score by 1.13x, indicating higher consistency between the generated results and the text conditions.
### Conclusion
By introducing Token-wise Feature Caching (ToCa), the paper successfully addresses the high computational cost of Diffusion Transformers in image and video generation, achieving efficient acceleration while maintaining generation quality.