Abstract:In this paper, we present \textbf{\textit{FasterCache}}, a novel training-free strategy designed to accelerate the inference of video diffusion models with high-quality generation. By analyzing existing cache-based methods, we observe that \textit{directly reusing adjacent-step features degrades video quality due to the loss of subtle variations}. We further perform a pioneering investigation of the acceleration potential of classifier-free guidance (CFG) and reveal significant redundancy between conditional and unconditional features within the same timestep. Capitalizing on these observations, we introduce FasterCache to substantially accelerate diffusion-based video generation. Our key contributions include a dynamic feature reuse strategy that preserves both feature distinction and temporal continuity, and CFG-Cache which optimizes the reuse of conditional and unconditional outputs to further enhance inference speed without compromising video quality. We empirically evaluate FasterCache on recent video diffusion models. Experimental results show that FasterCache can significantly accelerate video generation (\eg 1.67$\times$ speedup on Vchitect-2.0) while keeping video quality comparable to the baseline, and consistently outperform existing methods in both inference speed and video quality.

What problem does this paper attempt to address?

The main problem this paper attempts to address is improving the inference speed of video diffusion models while maintaining the quality of the generated videos. Specifically, although existing caching acceleration methods can effectively reduce computational costs, directly reusing features from adjacent steps leads to a decline in video quality, especially in terms of detail preservation. Additionally, classifier-free guidance (CFG) significantly enhances the quality of synthesized images/videos but also adds extra computational burden, extending the inference time. To tackle these issues, the authors propose FasterCache, a new training-free strategy aimed at accelerating the inference process of video diffusion models through dynamic feature reuse and CFG-Cache techniques, thereby achieving efficient video generation without sacrificing video quality. The key contributions of FasterCache include: 1. **Dynamic Feature Reuse Strategy**: This strategy dynamically adjusts the reused features between different time steps, ensuring that the feature differences and temporal continuity between adjacent time steps are maintained. This allows for accelerated inference while preserving subtle changes and details in the generated videos. 2. **CFG-Cache**: This technique stores the residuals between conditional and unconditional outputs and dynamically enhances the high and low-frequency components of these residuals before reuse, further speeding up the inference process while maintaining video details. Experimental results show that FasterCache achieves significant acceleration across multiple video diffusion models while maintaining or even improving video quality compared to baseline models. For example, on the Vchitect-2.0 model, FasterCache achieved a 1.67x speedup with performance comparable to the baseline model (VBench: Baseline 80.80% → FasterCache 80.84%). In summary, this paper proposes an innovative acceleration strategy by deeply analyzing the limitations of existing caching acceleration methods and the redundancy of CFG. This strategy not only improves inference efficiency but also ensures the quality of generated videos, demonstrating significant practical application value.

FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

Adaptive Caching for Faster Video Generation with Diffusion Transformers

DeepCache: Accelerating Diffusion Models for Free

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

Cache Me if You Can: Accelerating Diffusion Models through Block Caching

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing

Accelerating Diffusion Transformers with Token-wise Feature Caching

HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration

Efficiency-optimized Video Diffusion Models

Accelerating Vision Diffusion Transformers with Skip Branches

Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference

FORA: Fast-Forward Caching in Diffusion Transformer Acceleration

Reactive Video Caching via long-short-term fusion approach

Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

A Long-Short-Term Fusion Approach for Video Cache.

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

Frame-Level Video Caching and Transmission Scheduling Via Stochastic Learning

Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference

PrefCache: Edge Cache Admission with User Preference Learning for Video Content Distribution

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition