CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang,Jiayan Teng,Wendi Zheng,Ming Ding,Shiyu Huang,Jiazheng Xu,Yuanming Yang,Wenyi Hong,Xiaohan Zhang,Guanyu Feng,Da Yin,Xiaotao Gu,Yuxuan Zhang,Weihan Wang,Yean Cheng,Ting Liu,Bin Xu,Yuxiao Dong,Jie Tang

2024-10-08

Abstract:We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity. Second, to improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing a progressive training and multi-resolution frame pack technique, CogVideoX is adept at producing coherent, long-duration, different shape videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method, greatly contributing to the generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of both 3D Causal VAE, Video caption model and CogVideoX are publicly available at <a class="link-external link-https" href="https://github.com/THUDM/CogVideo" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main goal of this paper is to address several key issues present in existing text-to-video generation models, specifically including: 1. **Coherence and Dynamics of Generated Videos**: Existing video generation models often struggle to produce long videos with coherent narratives and rich dynamic effects. 2. **Video Quality**: Improving the quality and stability of generated videos through enhanced compression methods. 3. **Modal Fusion**: Enhancing the alignment between text and video so that the generated videos better reflect the content described in the text. To address these issues, the research team proposed CogVideoX, a large-scale text-to-video generation model based on a diffusion transformer. By introducing a 3D variational autoencoder, expert transformers, and a series of training strategies, this model can generate high-quality long-duration videos while ensuring video coherence and dynamics. Experimental results show that CogVideoX performs excellently across multiple evaluation metrics, surpassing currently available public models.

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

ControlVideo: Training-free Controllable Text-to-Video Generation

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models

ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with Diffusion Models

Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Compositional Video Generation as Flow Equalization