Adaptive Caching for Faster Video Generation with Diffusion Transformers

Kumara Kahatapitiya,Haozhe Liu,Sen He,Ding Liu,Menglin Jia,Michael S. Ryoo,Tian Xie
2024-11-05
Abstract:Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper attempts to solve the problems of high computational cost and slow inference speed in the video generation process. Specifically, although recent Diffusion Transformers (DiTs) have made significant progress in generating high - quality videos with consistent generation time, they rely on larger models and heavier attention mechanisms, resulting in higher computational requirements and slower inference speeds. Especially for long - video generation, these challenges become more prominent. To address these challenges, the author introduces a training - independent method named **Adaptive Caching (AdaCache)**, which aims to accelerate video DiTs. The core ideas of AdaCache are: 1. **Content - dependent caching strategy**: Not all videos are the same. Some videos require fewer denoising steps to reach a reasonable quality, while others require more steps. AdaCache caches the computations in the diffusion process and dynamically adjusts the caching plan according to the generated video content, thereby maximizing the trade - off between quality and latency. 2. **Motion Regularization (MoReg)**: Use the motion information in the video to control the computational allocation. Specifically, if the generated video contains a large amount of motion content, reduce the cache (i.e., increase the frequency of recomputation) to ensure the generation quality. Through these methods, AdaCache can significantly improve the inference speed (for example, achieving up to a 4.7 - fold acceleration in Open - Sora 720p - 2s video generation), without sacrificing the generation quality. These contributions are applicable to multiple video DiT baseline models.