EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

Jiaqi Xu,Xinyi Zou,Kunzhe Huang,Yunkuo Chen,Bo Liu,MengLi Cheng,Xing Shi,Jun Huang

2024-07-05

Abstract:This paper presents EasyAnimate, an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes. We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block. It is used to capture temporal dynamics, thereby ensuring the production of consistent frames and seamless motion transitions. The motion module can be adapted to various DiT baseline methods to generate video with different styles. It can also generate videos with different frame rates and resolutions during both training and inference phases, suitable for both images and videos. Moreover, we introduce slice VAE, a novel approach to condense the temporal axis, facilitating the generation of long duration videos. Currently, EasyAnimate exhibits the proficiency to generate videos with 144 frames. We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training (both the baseline model and LoRA model), and end-to-end video inference. Code is available at: <a class="link-external link-https" href="https://github.com/aigc-apps/EasyAnimate" rel="external noopener nofollow">this https URL</a>. We are continuously working to enhance the performance of our method.

Computer Vision and Pattern Recognition,Computation and Language,Multimedia

What problem does this paper attempt to address?

The paper proposes an advanced video generation method called EasyAnimate, which leverages the advantages of the Transformer architecture to achieve high performance. In the study, the authors extended the Diffusion Transformer (DiT) framework originally used for 2D image synthesis to adapt to the complexity of 3D video generation. They introduced a special module called Hybrid Motion Module, which combines temporal attention and global attention to ensure coherent frames and smooth motion transitions. In addition, the paper also introduces Slice VAE, a novel technique for compressing the timeline to reduce GPU memory consumption as the video length increases, thereby facilitating the generation of long videos. EasyAnimate is capable of generating videos lasting up to 144 frames from images of different resolutions, and provides a complete ecosystem covering data preprocessing, VAE training, DiT model training (including baseline model and LoRA model), and end-to-end video inference. The main contributions of this research are: 1. Introducing EasyAnimate, an efficient video generation method based on Transformer. 2. Exploring temporal information in video generation through the Hybrid Motion Module. 3. Introducing Slice VAE for effective time dimension compression, reducing memory usage and supporting long video generation. The paper also discusses challenges in existing video generation models, such as poor quality, limited video length, and unnatural motion, and compares it with other related works such as Video VAE, MagViT, and Sora. The training process of EasyAnimate includes multi-stage training of video VAE and video Diffusion Transformer to progressively improve the performance and dynamic representation of the model.

EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation

Adaptive Caching for Faster Video Generation with Diffusion Transformers

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

AtomoVideo: High Fidelity Image-to-Video Generation

Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

AnimateAnything: Consistent and Controllable Animation for Video Generation

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

DiVE: DiT-based Video Generation with Enhanced Control

VEnhancer: Generative Space-Time Enhancement for Video Generation

LoopAnimate: Loopable Salient Object Animation

Anchored Diffusion for Video Face Reenactment

Controllable Longer Image Animation with Diffusion Models

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

CPA: Camera-pose-awareness Diffusion Transformer for Video Generation

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction