Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing

Kaifeng Gao,Jiaxin Shi,Hanwang Zhang,Chunping Wang,Jun Xiao,Long Chen
2024-11-25
Abstract:With the advance of diffusion models, today's video generation has achieved impressive quality. To extend the generation length and facilitate real-world applications, a majority of video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent clips conditioned on the last frame(s) of the previous clip. However, existing autoregressive VDMs are highly inefficient and redundant: The model must re-compute all the conditional frames that are overlapped between adjacent clips. This issue is exacerbated when the conditional frames are extended autoregressively to provide the model with long-term context. In such cases, the computational demands increase significantly (i.e., with a quadratic complexity w.r.t. the autoregression step). In this paper, we propose Ca2-VDM, an efficient autoregressive VDM with Causal generation and Cache sharing. For causal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps and reused in every subsequent step, eliminating redundant computations. For cache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost. Extensive experiments demonstrated that our Ca2-VDM achieves state-of-the-art quantitative and qualitative video generation results and significantly improves the generation speed. Code is available at <a class="link-external link-https" href="https://github.com/Dawn-LX/CausalCache-VDM" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problems of inefficiency and redundant computation in existing video diffusion models (VDMs) when generating long - videos. Specifically, current autoregressive video diffusion models need to recalculate the overlapping conditional frames between adjacent segments when generating subsequent segments, resulting in a quadratic increase in computational complexity, especially in the case of providing long - term context. This not only increases the computational requirements but also limits the flexibility and efficiency of these models in practical applications. To solve these problems, the authors propose **Ca2 - VDM** (Causal generation and Cache sharing Video Diffusion Model), an efficient autoregressive video diffusion model. The model improves efficiency through the following two innovative methods: 1. **Causal Generation**: - Introduce one - way feature calculation to ensure that the cache of conditional frames can be pre - calculated in previous autoregressive steps and reused in subsequent steps, thereby eliminating redundant computation. - Use the causal attention mechanism so that each generated frame only depends on its previous frames, avoiding the problem of repeated computation brought by bidirectional attention. 2. **Cache Sharing**: - Share the cache in all denoising steps to avoid huge cache storage costs. - Manage the time - KV cache through a queue structure and combine cyclic time position encodings (Cyclic - TPEs) to support this cache mechanism. Through these improvements, Ca2 - VDM not only significantly improves the generation speed but also shows quantitative and qualitative performance comparable to existing state - of - the - art models in experiments on multiple public datasets. ### Formula Representation - **Causal Attention Calculation Formula**: \[ \text{CausalAttn}(Q, K, V)=\text{Softmax}\left(\frac{QK^{T}}{\sqrt{C'}} + M\right)V \] where \(M\in\mathbb{R}^{L\times L}\) is a lower triangular attention masking matrix, \(M_{i,j}=-\infty\) if \(i < j\), and 0 otherwise. - **Simplified Objective Function**: \[ L_{\text{simple}}(\theta)=\mathbb{E}_{z,\epsilon,t}\left[\left\|\epsilon_{\theta}(z_{t},t)-\epsilon\right\|_{2}^{2}\right],\quad\epsilon\sim\mathcal{N}(0,I) \] - **Enhanced Objective Function**: \[ eL_{\text{simple}}(\theta)=\mathbb{E}_{z,\epsilon,t}\left[\left\|\left(\epsilon_{\theta}([z_{0}^{P},z_{P}^{L}_{t}],t)-\epsilon\right)\odot m\right\|_{2}^{2}\right] \] where \([·,·]\) represents concatenation along the time axis, \(t\) is the time - step vector, and \(m\in\{0,1\}^{N}\) is the loss mask. Through these techniques, Ca2 - VDM achieves efficient and flexible video generation, especially suitable for long - time or real - time video generation tasks.