Abstract:Text-to-video models have recently undergone rapid and substantial advancements. Nevertheless, due to limitations in data and computational resources, achieving efficient generation of long videos with rich motion dynamics remains a significant challenge. To generate high-quality, dynamic, and temporally consistent long videos, this paper presents ARLON, a novel framework that boosts diffusion Transformers with autoregressive models for long video generation, by integrating the coarse spatial and long-range temporal information provided by the AR model to guide the DiT model. Specifically, ARLON incorporates several key innovations: 1) A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens, bridging the AR and DiT models and balancing the learning complexity and information density; 2) An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model, ensuring effective guidance during video generation; 3) To enhance the tolerance capability of noise introduced from the AR inference, the DiT model is trained with coarser visual latent tokens incorporated with an uncertainty sampling module. Experimental results demonstrate that ARLON significantly outperforms the baseline OpenSora-V1.2 on eight out of eleven metrics selected from VBench, with notable improvements in dynamic degree and aesthetic quality, while delivering competitive results on the remaining three and simultaneously accelerating the generation process. In addition, ARLON achieves state-of-the-art performance in long video generation. Detailed analyses of the improvements in inference efficiency are presented, alongside a practical application that demonstrates the generation of long videos using progressive text prompts. See demos of ARLON at \url{<a class="link-external link-http" href="http://aka.ms/arlon" rel="external noopener nofollow">this http URL</a>}.

What problem does this paper attempt to address?

The main problem this paper attempts to address is the generation of high-quality, dynamic, and temporally consistent long videos. Specifically, despite the rapid development of text-to-video (T2V) models in recent years, efficiently generating long videos with rich dynamic motion remains a significant challenge due to data and computational resource limitations. To this end, the paper proposes the ARLON framework, which enhances the capability of long video generation by combining autoregressive (AR) models and diffusion transformers (DiT) models. ### Main Problems: 1. **High Training Cost**: Especially for high-resolution videos, training is limited to short video clips per batch, resulting in insufficient dynamic motion. 2. **Complex and Time-Consuming Generation Process**: The denoising generation of videos based entirely on text conditions is inherently complex and time-consuming. 3. **Difficulty in Generating Long Videos**: Maintaining consistent motion and diverse content is particularly challenging when generating long videos. ### Solutions: - **ARLON Framework**: By combining autoregressive Transformers and diffusion transformers (DiT) models, the AR model provides coarse-grained spatial and long-term temporal information to guide the DiT model, thereby generating high-quality, dynamic, and temporally consistent long videos. - **Key Innovations**: 1. **Vector Quantized Variational Autoencoder (VQ-VAE)**: Compresses the latent space input of the DiT model, generating compact and highly quantized visual tokens, balancing learning complexity and information density. 2. **Adaptive Normalization Semantic Injection Module**: Integrates coarse-grained discrete visual units generated by the AR model into the DiT model, ensuring effective guidance during the video generation process. 3. **Uncertainty Sampling Module**: Enhances the DiT model's tolerance to noise introduced during AR inference by training with coarser visual latent tokens and an uncertainty sampling strategy. ### Experimental Results: - **Performance Evaluation**: ARLON significantly outperforms the baseline model OpenSora-V1.2 on multiple metrics, with notable improvements in dynamic degree and aesthetic quality, while also showing competitive performance on other metrics. - **Long Video Generation**: ARLON achieves state-of-the-art performance in long video generation tasks, surpassing other open-source models. ### Summary: The ARLON framework effectively addresses the challenge of generating high-quality, dynamic, and temporally consistent long videos by combining the strengths of autoregressive models and diffusion models, providing a new solution for the text-to-video generation field.

ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models

ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with Diffusion Models

DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models

Progressive Autoregressive Video Diffusion Models

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

Autoregressive Video Generation without Vector Quantization

VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation

LoVA: Long-form Video-to-Audio Generation

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

LTX-Video: Realtime Video Latent Diffusion

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Advancing Auto-Regressive Continuation for Video Frames

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

Anchored Diffusion for Video Face Reenactment

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations