Abstract:Text-to-video models have recently undergone rapid and substantial advancements. Nevertheless, due to limitations in data and computational resources, achieving efficient generation of long videos with rich motion dynamics remains a significant challenge. To generate high-quality, dynamic, and temporally consistent long videos, this paper presents ARLON, a novel framework that boosts diffusion Transformers with autoregressive models for long video generation, by integrating the coarse spatial and long-range temporal information provided by the AR model to guide the DiT model. Specifically, ARLON incorporates several key innovations: 1) A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens, bridging the AR and DiT models and balancing the learning complexity and information density; 2) An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model, ensuring effective guidance during video generation; 3) To enhance the tolerance capability of noise introduced from the AR inference, the DiT model is trained with coarser visual latent tokens incorporated with an uncertainty sampling module. Experimental results demonstrate that ARLON significantly outperforms the baseline OpenSora-V1.2 on eight out of eleven metrics selected from VBench, with notable improvements in dynamic degree and aesthetic quality, while delivering competitive results on the remaining three and simultaneously accelerating the generation process. In addition, ARLON achieves state-of-the-art performance in long video generation. Detailed analyses of the improvements in inference efficiency are presented, alongside a practical application that demonstrates the generation of long videos using progressive text prompts. See demos of ARLON at \url{<a class="link-external link-http" href="http://aka.ms/arlon" rel="external noopener nofollow">this http URL</a>}.

LTX-Video: Realtime Video Latent Diffusion

Latte: Latent Diffusion Transformer for Video Generation

Photorealistic Video Generation with Diffusion Models

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

REDUCIO! Generating 1024×1024 Video Within 16 Seconds Using Extremely Compressed Motion Latents

REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents

VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Lumiere: A Space-Time Diffusion Model for Video Generation

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation

Video Diffusion Models

Video Probabilistic Diffusion Models in Projected Latent Space