Adaptive Caching for Faster Video Generation with Diffusion Transformers

Kumara Kahatapitiya,Haozhe Liu,Sen He,Ding Liu,Menglin Jia,Michael S. Ryoo,Tian Xie

2024-11-05

Abstract:Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper attempts to solve the problems of high computational cost and slow inference speed in the video generation process. Specifically, although recent Diffusion Transformers (DiTs) have made significant progress in generating high - quality videos with consistent generation time, they rely on larger models and heavier attention mechanisms, resulting in higher computational requirements and slower inference speeds. Especially for long - video generation, these challenges become more prominent. To address these challenges, the author introduces a training - independent method named **Adaptive Caching (AdaCache)**, which aims to accelerate video DiTs. The core ideas of AdaCache are: 1. **Content - dependent caching strategy**: Not all videos are the same. Some videos require fewer denoising steps to reach a reasonable quality, while others require more steps. AdaCache caches the computations in the diffusion process and dynamically adjusts the caching plan according to the generated video content, thereby maximizing the trade - off between quality and latency. 2. **Motion Regularization (MoReg)**: Use the motion information in the video to control the computational allocation. Specifically, if the generated video contains a large amount of motion content, reduce the cache (i.e., increase the frequency of recomputation) to ensure the generation quality. Through these methods, AdaCache can significantly improve the inference speed (for example, achieving up to a 4.7 - fold acceleration in Open - Sora 720p - 2s video generation), without sacrificing the generation quality. These contributions are applicable to multiple video DiT baseline models.

Adaptive Caching for Faster Video Generation with Diffusion Transformers

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

Accelerating Vision Diffusion Transformers with Skip Branches

HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration

$Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

Accelerating Diffusion Transformers with Token-wise Feature Caching

FORA: Fast-Forward Caching in Diffusion Transformer Acceleration

Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing

Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

Efficiency-optimized Video Diffusion Models

SimDA: Simple Diffusion Adapter for Efficient Video Generation

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis

DeepCache: Accelerating Diffusion Models for Free

Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models

AdaDiff: Adaptive Step Selection for Fast Diffusion.

Reactive Video Caching via long-short-term fusion approach