Abstract:With the advance of diffusion models, today's video generation has achieved impressive quality. To extend the generation length and facilitate real-world applications, a majority of video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent clips conditioned on the last frame(s) of the previous clip. However, existing autoregressive VDMs are highly inefficient and redundant: The model must re-compute all the conditional frames that are overlapped between adjacent clips. This issue is exacerbated when the conditional frames are extended autoregressively to provide the model with long-term context. In such cases, the computational demands increase significantly (i.e., with a quadratic complexity w.r.t. the autoregression step). In this paper, we propose Ca2-VDM, an efficient autoregressive VDM with Causal generation and Cache sharing. For causal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps and reused in every subsequent step, eliminating redundant computations. For cache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost. Extensive experiments demonstrated that our Ca2-VDM achieves state-of-the-art quantitative and qualitative video generation results and significantly improves the generation speed. Code is available at <a class="link-external link-https" href="https://github.com/Dawn-LX/CausalCache-VDM" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problems of inefficiency and redundant computation in existing video diffusion models (VDMs) when generating long - videos. Specifically, current autoregressive video diffusion models need to recalculate the overlapping conditional frames between adjacent segments when generating subsequent segments, resulting in a quadratic increase in computational complexity, especially in the case of providing long - term context. This not only increases the computational requirements but also limits the flexibility and efficiency of these models in practical applications. To solve these problems, the authors propose **Ca2 - VDM** (Causal generation and Cache sharing Video Diffusion Model), an efficient autoregressive video diffusion model. The model improves efficiency through the following two innovative methods: 1. **Causal Generation**: - Introduce one - way feature calculation to ensure that the cache of conditional frames can be pre - calculated in previous autoregressive steps and reused in subsequent steps, thereby eliminating redundant computation. - Use the causal attention mechanism so that each generated frame only depends on its previous frames, avoiding the problem of repeated computation brought by bidirectional attention. 2. **Cache Sharing**: - Share the cache in all denoising steps to avoid huge cache storage costs. - Manage the time - KV cache through a queue structure and combine cyclic time position encodings (Cyclic - TPEs) to support this cache mechanism. Through these improvements, Ca2 - VDM not only significantly improves the generation speed but also shows quantitative and qualitative performance comparable to existing state - of - the - art models in experiments on multiple public datasets. ### Formula Representation - **Causal Attention Calculation Formula**: \[ \text{CausalAttn}(Q, K, V)=\text{Softmax}\left(\frac{QK^{T}}{\sqrt{C'}} + M\right)V \] where \(M\in\mathbb{R}^{L\times L}\) is a lower triangular attention masking matrix, \(M_{i,j}=-\infty\) if \(i < j\), and 0 otherwise. - **Simplified Objective Function**: \[ L_{\text{simple}}(\theta)=\mathbb{E}_{z,\epsilon,t}\left[\left\|\epsilon_{\theta}(z_{t},t)-\epsilon\right\|_{2}^{2}\right],\quad\epsilon\sim\mathcal{N}(0,I) \] - **Enhanced Objective Function**: \[ eL_{\text{simple}}(\theta)=\mathbb{E}_{z,\epsilon,t}\left[\left\|\left(\epsilon_{\theta}([z_{0}^{P},z_{P}^{L}_{t}],t)-\epsilon\right)\odot m\right\|_{2}^{2}\right] \] where \([·,·]\) represents concatenation along the time axis, \(t\) is the time - step vector, and \(m\in\{0,1\}^{N}\) is the loss mask. Through these techniques, Ca2 - VDM achieves efficient and flexible video generation, especially suitable for long - time or real - time video generation tasks.

Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing

ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Progressive Autoregressive Video Diffusion Models

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models

MV-Diffusion: Motion-aware Video Diffusion Model

Improved Video VAE for Latent Video Diffusion Model

Efficiency-optimized Video Diffusion Models

ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models

GD-VDM: Generated Depth for better Diffusion-based Video Generation

Video Probabilistic Diffusion Models in Projected Latent Space

VIDM: Video Implicit Diffusion Models

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

Video Diffusion Models with Local-Global Context Guidance

Video Diffusion Models

SF-V: Single Forward Video Generation Model