Abstract:Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 10 seconds or 240 frames, due to computation limitations during training. In this work, we show that existing models can be naturally extended to autoregressive video diffusion models without changing the architectures. Our key idea is to assign the latent frames with progressively increasing noise levels rather than a single noise level, which allows for fine-grained condition among the latents and large overlaps between the attention windows. Such progressive video denoising allows our models to autoregressively generate video frames without quality degradation or abrupt scene changes. We present state-of-the-art results on long video generation at 1 minute (1440 frames at 24 FPS). Videos from this paper are available at <a class="link-external link-https" href="https://desaixie.github.io/pa-vdm/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the quality and continuity issues of existing video diffusion models when generating long - form videos. Specifically: 1. **Limitations of Generating Short Clips**: Although the current state - of - the - art video diffusion models can generate high - quality short video clips (usually around 10 seconds or 240 frames), they are unable to generate longer - duration videos due to computational limitations during training. 2. **Quality Degradation and Sudden Scene Changes**: When attempting to extend the video length in an autoregressive manner, existing methods are prone to causing a decline in the generated video quality or sudden scene changes, especially when generating longer videos. To address these problems, the authors propose a new method - **Progressive Autoregressive Video Diffusion Models (PA - VDM)**. The core innovation of this method lies in the progressive denoising process, that is, gradually increasing the noise level of each frame during the denoising process instead of using a single noise level. This method enables the model to maintain high quality and smooth transitions when generating long - form videos, avoiding problems such as sudden scene changes and unnatural motion. Specific improvements include: - **Progressive Noise Scheduling**: By assigning gradually increasing noise levels to each frame, subsequent frames can better follow the patterns of previous frames, thereby achieving smoother temporal transitions. - **Large Overlapping Attention Windows**: Compared to other methods, PA - VDM can achieve larger attention window overlaps without incurring additional computational costs, thereby enhancing the consistency and coherence of the generated videos. - **No Need to Modify the Architecture**: This method can be implemented by fine - tuning pre - trained models without changing the original model architecture. Through these improvements, PA - VDM can maintain high quality when generating videos up to one minute (1440 frames) long and outperforms existing baseline methods in both quantitative and qualitative evaluations.

Progressive Autoregressive Video Diffusion Models

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Diffusion Probabilistic Modeling for Video Generation

ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models

Video Diffusion Models

MV-Diffusion: Motion-aware Video Diffusion Model

Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing

Flexible Diffusion Modeling of Long Videos

Video Probabilistic Diffusion Models in Projected Latent Space

ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

SF-V: Single Forward Video Generation Model

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation

VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

Controllable Longer Image Animation with Diffusion Models

Extreme Video Compression with Pre-trained Diffusion Models

Accelerating Video Diffusion Models via Distribution Matching