Progressive Autoregressive Video Diffusion Models

Desai Xie,Zhan Xu,Yicong Hong,Hao Tan,Difan Liu,Feng Liu,Arie Kaufman,Yang Zhou
2024-10-11
Abstract:Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 10 seconds or 240 frames, due to computation limitations during training. In this work, we show that existing models can be naturally extended to autoregressive video diffusion models without changing the architectures. Our key idea is to assign the latent frames with progressively increasing noise levels rather than a single noise level, which allows for fine-grained condition among the latents and large overlaps between the attention windows. Such progressive video denoising allows our models to autoregressively generate video frames without quality degradation or abrupt scene changes. We present state-of-the-art results on long video generation at 1 minute (1440 frames at 24 FPS). Videos from this paper are available at <a class="link-external link-https" href="https://desaixie.github.io/pa-vdm/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the quality and continuity issues of existing video diffusion models when generating long - form videos. Specifically: 1. **Limitations of Generating Short Clips**: Although the current state - of - the - art video diffusion models can generate high - quality short video clips (usually around 10 seconds or 240 frames), they are unable to generate longer - duration videos due to computational limitations during training. 2. **Quality Degradation and Sudden Scene Changes**: When attempting to extend the video length in an autoregressive manner, existing methods are prone to causing a decline in the generated video quality or sudden scene changes, especially when generating longer videos. To address these problems, the authors propose a new method - **Progressive Autoregressive Video Diffusion Models (PA - VDM)**. The core innovation of this method lies in the progressive denoising process, that is, gradually increasing the noise level of each frame during the denoising process instead of using a single noise level. This method enables the model to maintain high quality and smooth transitions when generating long - form videos, avoiding problems such as sudden scene changes and unnatural motion. Specific improvements include: - **Progressive Noise Scheduling**: By assigning gradually increasing noise levels to each frame, subsequent frames can better follow the patterns of previous frames, thereby achieving smoother temporal transitions. - **Large Overlapping Attention Windows**: Compared to other methods, PA - VDM can achieve larger attention window overlaps without incurring additional computational costs, thereby enhancing the consistency and coherence of the generated videos. - **No Need to Modify the Architecture**: This method can be implemented by fine - tuning pre - trained models without changing the original model architecture. Through these improvements, PA - VDM can maintain high quality when generating videos up to one minute (1440 frames) long and outperforms existing baseline methods in both quantitative and qualitative evaluations.