VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk,Lijun Yu,Xiuye Gu,José Lezama,Jonathan Huang,Grant Schindler,Rachel Hornung,Vighnesh Birodkar,Jimmy Yan,Ming-Chang Chiu,Krishna Somandepalli,Hassan Akbari,Yair Alon,Yong Cheng,Josh Dillon,Agrim Gupta,Meera Hahn,Anja Hauth,David Hendon,Alonso Martinez,David Minnen,Mikhail Sirotenko,Kihyuk Sohn,Xuan Yang,Hartwig Adam,Ming-Hsuan Yang,Irfan Essa,Huisheng Wang,David A. Ross,Bryan Seybold,Lu Jiang

2024-06-05

Abstract:We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the problem of how to utilize large language models (LLMs) for high-quality video generation, particularly the ability to generate high-fidelity videos under zero-shot conditions. Specifically, the paper introduces a new model called VideoPoet, which can synthesize high-quality videos from various conditional signals, including images, videos, text, and audio. VideoPoet adopts a decoder-only Transformer architecture and achieves its functionality through a two-stage training process: pre-training and task-specific adaptation. The pre-training stage employs a mixture of multimodal generation objectives, while in the task-specific adaptation stage, the model is fine-tuned to improve the generation quality for specific tasks or to perform new tasks. The main contributions of the paper are: 1. Proposing a training method for a large language model (LLM) specifically for video generation, utilizing data with paired and unpaired videos for tokenization. 2. Developing a video super-resolution method that enhances spatial resolution by using a bidirectional Transformer and an efficient windowed local attention mechanism in the latent token space. 3. Demonstrating and evaluating the competitiveness and cutting-edge performance of VideoPoet in generating videos with realistic and dynamic effects. Overall, the study aims to explore and demonstrate the potential of using LLMs for video generation, especially in zero-shot video generation, which contrasts with the current mainstream diffusion model-based approaches.

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation

GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

VideoLLM: Modeling Video Sequence with Large Language Models

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination

VideoGPT: Video Generation using VQ-VAE and Transformers

HunyuanVideo: A Systematic Framework For Large Video Generative Models

A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Scaling Up Video Summarization Pretraining with Large Language Models

Distilling Vision-Language Models on Millions of Videos

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

Movie Gen: A Cast of Media Foundation Models

Video as the New Language for Real-World Decision Making

Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages