FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

Zhengyao Lv,Chenyang Si,Junhao Song,Zhenyu Yang,Yu Qiao,Ziwei Liu,Kwan-Yee K. Wong
2024-10-25
Abstract:In this paper, we present \textbf{\textit{FasterCache}}, a novel training-free strategy designed to accelerate the inference of video diffusion models with high-quality generation. By analyzing existing cache-based methods, we observe that \textit{directly reusing adjacent-step features degrades video quality due to the loss of subtle variations}. We further perform a pioneering investigation of the acceleration potential of classifier-free guidance (CFG) and reveal significant redundancy between conditional and unconditional features within the same timestep. Capitalizing on these observations, we introduce FasterCache to substantially accelerate diffusion-based video generation. Our key contributions include a dynamic feature reuse strategy that preserves both feature distinction and temporal continuity, and CFG-Cache which optimizes the reuse of conditional and unconditional outputs to further enhance inference speed without compromising video quality. We empirically evaluate FasterCache on recent video diffusion models. Experimental results show that FasterCache can significantly accelerate video generation (\eg 1.67$\times$ speedup on Vchitect-2.0) while keeping video quality comparable to the baseline, and consistently outperform existing methods in both inference speed and video quality.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem this paper attempts to address is improving the inference speed of video diffusion models while maintaining the quality of the generated videos. Specifically, although existing caching acceleration methods can effectively reduce computational costs, directly reusing features from adjacent steps leads to a decline in video quality, especially in terms of detail preservation. Additionally, classifier-free guidance (CFG) significantly enhances the quality of synthesized images/videos but also adds extra computational burden, extending the inference time. To tackle these issues, the authors propose FasterCache, a new training-free strategy aimed at accelerating the inference process of video diffusion models through dynamic feature reuse and CFG-Cache techniques, thereby achieving efficient video generation without sacrificing video quality. The key contributions of FasterCache include: 1. **Dynamic Feature Reuse Strategy**: This strategy dynamically adjusts the reused features between different time steps, ensuring that the feature differences and temporal continuity between adjacent time steps are maintained. This allows for accelerated inference while preserving subtle changes and details in the generated videos. 2. **CFG-Cache**: This technique stores the residuals between conditional and unconditional outputs and dynamically enhances the high and low-frequency components of these residuals before reuse, further speeding up the inference process while maintaining video details. Experimental results show that FasterCache achieves significant acceleration across multiple video diffusion models while maintaining or even improving video quality compared to baseline models. For example, on the Vchitect-2.0 model, FasterCache achieved a 1.67x speedup with performance comparable to the baseline model (VBench: Baseline 80.80% → FasterCache 80.84%). In summary, this paper proposes an innovative acceleration strategy by deeply analyzing the limitations of existing caching acceleration methods and the redundancy of CFG. This strategy not only improves inference efficiency but also ensures the quality of generated videos, demonstrating significant practical application value.