VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk,Lijun Yu,Xiuye Gu,José Lezama,Jonathan Huang,Grant Schindler,Rachel Hornung,Vighnesh Birodkar,Jimmy Yan,Ming-Chang Chiu,Krishna Somandepalli,Hassan Akbari,Yair Alon,Yong Cheng,Josh Dillon,Agrim Gupta,Meera Hahn,Anja Hauth,David Hendon,Alonso Martinez,David Minnen,Mikhail Sirotenko,Kihyuk Sohn,Xuan Yang,Hartwig Adam,Ming-Hsuan Yang,Irfan Essa,Huisheng Wang,David A. Ross,Bryan Seybold,Lu Jiang
2024-06-05
Abstract:We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the problem of how to utilize large language models (LLMs) for high-quality video generation, particularly the ability to generate high-fidelity videos under zero-shot conditions. Specifically, the paper introduces a new model called VideoPoet, which can synthesize high-quality videos from various conditional signals, including images, videos, text, and audio. VideoPoet adopts a decoder-only Transformer architecture and achieves its functionality through a two-stage training process: pre-training and task-specific adaptation. The pre-training stage employs a mixture of multimodal generation objectives, while in the task-specific adaptation stage, the model is fine-tuned to improve the generation quality for specific tasks or to perform new tasks. The main contributions of the paper are: 1. Proposing a training method for a large language model (LLM) specifically for video generation, utilizing data with paired and unpaired videos for tokenization. 2. Developing a video super-resolution method that enhances spatial resolution by using a bidirectional Transformer and an efficient windowed local attention mechanism in the latent token space. 3. Demonstrating and evaluating the competitiveness and cutting-edge performance of VideoPoet in generating videos with realistic and dynamic effects. Overall, the study aims to explore and demonstrate the potential of using LLMs for video generation, especially in zero-shot video generation, which contrasts with the current mainstream diffusion model-based approaches.