Abstract:Recent advances in auto-regressive large language models (LLMs) have shown their potential in generating high-quality text, inspiring researchers to apply them to image and video generation. This paper explores the application of LLMs to video continuation, a task essential for building world models and predicting future frames. In this paper, we tackle challenges including preventing degeneration in long-term frame generation and enhancing the quality of generated images. We design a scheme named ARCON, which involves training our model to alternately generate semantic tokens and RGB tokens, enabling the LLM to explicitly learn and predict the high-level structural information of the video. We find high consistency in the RGB images and semantic maps generated without special design. Moreover, we employ an optical flow-based texture stitching method to enhance the visual quality of the generated videos. Quantitative and qualitative experiments in autonomous driving scenarios demonstrate our model can consistently generate long videos.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are several key challenges in video continuation, specifically including: 1. **Preventing degradation in long - term frame generation**: During the long - term video generation process, the model may get stuck in static results or repetitive patterns, resulting in a lack of diversity and realism in the generated content. The authors propose the ARCON scheme to alleviate this problem by alternately generating semantic tokens and RGB tokens. 2. **Improving the quality of generated images**: Traditional autoregressive models may have problems such as blurring when generating high - quality images. For this reason, the authors introduce a texture - splicing method based on optical flow to enhance the visual quality of the generated video. 3. **Constructing a world model and predicting future frames**: The video continuation task is crucial for constructing a world model and predicting future scenes. The authors hope to improve the application of autoregressive large language models (LLMs) so that the model can better understand and predict dynamic changes in the video. ### Main contributions - Proposed a video continuation model based on a visual tokenizer and LLM architecture, with potential emergent capabilities. - Used semantic tokens to improve the model's ability to generate longer videos and enhance temporal consistency. The generated semantic maps have a good correspondence with RGB images. - Through extensive experimental verification, it has been shown that the model can generate high - quality long - term videos in different autonomous driving scenarios. ### Method overview The core idea of the ARCON model is to encode video frames into discrete tokens (including RGB tokens and semantic tokens) and alternately generate these tokens in an autoregressive manner. The specific steps are as follows: 1. **Image tokenizer**: Use the MAGVIT - v2 tokenizer to encode RGB images and semantic maps into discrete tokens. 2. **Autoregressive model**: Train a large autoregressive model based on the LLaMA architecture so that it can predict the tokens of subsequent frames according to the previous frames. 3. **Image decoder**: Decode the generated discrete tokens back into images and improve the quality of the generated images through an optical - flow - guided method. Through these methods, the ARCON model can generate high - quality and creative long - term videos while maintaining the structural information of the video.

Advancing Auto-Regressive Continuation for Video Frames

ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation

ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with Diffusion Models

Scaling Autoregressive Video Models

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models

Progressive Autoregressive Video Diffusion Models

ControlAR: Controllable Image Generation with Autoregressive Models

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models

ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation

Lifelong Learning of Video Diffusion Models From a Single Video Stream

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

Frame Interpolation with Consecutive Brownian Bridge Diffusion

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

Multi-modal Auto-regressive Modeling via Visual Words