EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

Jiaqi Xu,Xinyi Zou,Kunzhe Huang,Yunkuo Chen,Bo Liu,MengLi Cheng,Xing Shi,Jun Huang
2024-07-05
Abstract:This paper presents EasyAnimate, an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes. We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block. It is used to capture temporal dynamics, thereby ensuring the production of consistent frames and seamless motion transitions. The motion module can be adapted to various DiT baseline methods to generate video with different styles. It can also generate videos with different frame rates and resolutions during both training and inference phases, suitable for both images and videos. Moreover, we introduce slice VAE, a novel approach to condense the temporal axis, facilitating the generation of long duration videos. Currently, EasyAnimate exhibits the proficiency to generate videos with 144 frames. We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training (both the baseline model and LoRA model), and end-to-end video inference. Code is available at: <a class="link-external link-https" href="https://github.com/aigc-apps/EasyAnimate" rel="external noopener nofollow">this https URL</a>. We are continuously working to enhance the performance of our method.
Computer Vision and Pattern Recognition,Computation and Language,Multimedia
What problem does this paper attempt to address?
The paper proposes an advanced video generation method called EasyAnimate, which leverages the advantages of the Transformer architecture to achieve high performance. In the study, the authors extended the Diffusion Transformer (DiT) framework originally used for 2D image synthesis to adapt to the complexity of 3D video generation. They introduced a special module called Hybrid Motion Module, which combines temporal attention and global attention to ensure coherent frames and smooth motion transitions. In addition, the paper also introduces Slice VAE, a novel technique for compressing the timeline to reduce GPU memory consumption as the video length increases, thereby facilitating the generation of long videos. EasyAnimate is capable of generating videos lasting up to 144 frames from images of different resolutions, and provides a complete ecosystem covering data preprocessing, VAE training, DiT model training (including baseline model and LoRA model), and end-to-end video inference. The main contributions of this research are: 1. Introducing EasyAnimate, an efficient video generation method based on Transformer. 2. Exploring temporal information in video generation through the Hybrid Motion Module. 3. Introducing Slice VAE for effective time dimension compression, reducing memory usage and supporting long video generation. The paper also discusses challenges in existing video generation models, such as poor quality, limited video length, and unnatural motion, and compares it with other related works such as Video VAE, MagViT, and Sora. The training process of EasyAnimate includes multi-stage training of video VAE and video Diffusion Transformer to progressively improve the performance and dynamic representation of the model.