Improved Video VAE for Latent Video Diffusion Model

Pingyu Wu,Kai Zhu,Yu Liu,Liming Zhao,Wei Zhai,Yang Cao,Zheng-Jun Zha
2024-11-10
Abstract:Variational Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora and other latent video diffusion generation models. While most of existing video VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression, this paper presents two astonishing findings: (1) The initialization from a well-trained image VAE with the same latent dimensions suppresses the improvement of subsequent temporal compression capabilities. (2) The adoption of causal reasoning leads to unequal information interactions and unbalanced performance between frames. To alleviate these problems, we propose a keyframe-based temporal compression (KTC) architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE). Specifically, the KTC architecture divides the latent space into two branches, in which one half completely inherits the compression prior of keyframes from a lower-dimension image VAE while the other half involves temporal compression through 3D group causal convolution, reducing temporal-spatial conflicts and accelerating the convergence speed of video VAE. The GCConv in above 3D half uses standard convolution within each frame group to ensure inter-frame equivalence, and employs causal logical padding between groups to maintain flexibility in processing variable frame video. Extensive experiments on five benchmarks demonstrate the SOTA video reconstruction and generation capabilities of the proposed IV-VAE (<a class="link-external link-https" href="https://wpy1999.github.io/IV-VAE/" rel="external noopener nofollow">this https URL</a>).
Computer Vision and Pattern Recognition,Image and Video Processing
What problem does this paper attempt to address?
This paper attempts to solve the following problems: 1. **Initialization problems of existing video VAEs**: - Existing video variational auto - encoders (VAEs) are usually initialized from pre - trained image VAEs, which restricts the subsequent temporal compression ability to a certain extent. Specifically, initializing from a well - trained image VAE does not support further 4 - fold temporal compression, resulting in a sharp decline in the initial spatial compression performance and slow improvement in subsequent temporal compression. - The paper points out that initializing with an image VAE of the same latent dimension will inhibit the improvement of the subsequent temporal compression ability. 2. **Inter - frame imbalance problems caused by causal convolution**: - Existing 3D causal convolution methods will lead to unequal information interaction and performance imbalance between frames. Due to the unidirectional information flow characteristic of causal convolution, the first few frames in the same frame group cannot interact with other frames, while the last frame can obtain the information of the entire frame group, thus being biased towards the later frames in the optimization process. - This imbalance leads to the frequent occurrence of flickering artifacts in the reconstructed video, especially with large fluctuations in the SSIM values between different frames. 3. **Temporal motion perception problems in high - resolution videos**: - For high - resolution videos, the same motion spans more pixels, making it difficult for the model to capture the motion pattern. Existing VAEs have insufficient temporal motion perception ability at high resolutions, resulting in a decline in reconstruction quality. 4. **Lack of appropriate evaluation benchmarks**: - In the field of video generation, the requirements for content diversity and resolution are increasing, but existing open - source datasets (such as Kinetics - 600, UCF - 101, etc.) have a low resolution (720P or lower) and cannot be used for evaluation at higher resolutions (such as 1080P). - In addition, high - resolution datasets (such as Panda - 70M) mainly contain slow - motion and fixed - shot videos, and it is difficult to reflect the overall performance of video VAEs at different motion speeds. To solve these problems, the paper proposes the following improvement methods: 1. **Key - frame - based temporal compression architecture (KTC)**: - A two - branch architecture is proposed, in which one branch inherits the prior knowledge of the low - dimensional image VAE and focuses on key - frame compression; the other branch performs temporal compression through 3D group - causal convolution, reducing spatio - temporal conflicts and accelerating the convergence speed. 2. **Group - causal convolution (GCConv)**: - By grouping the input frames according to the temporal compression rate and applying causal logic padding operations between groups, the continuity of processing variable - frame videos is ensured. Standard convolution operations are used within each group of frames to share equivalent interaction information and achieve smoother reconstruction results. 3. **Temporal motion perception enhancement (TMPE)**: - Dilated convolution and multiple attention modules are introduced to expand the receptive field and improve the model's temporal motion perception ability at high resolutions. 4. **A new evaluation benchmark, MotionHD**: - A subset containing 2,000 1080P - resolution videos covering various motion speeds is recollected as a supplementary evaluation benchmark to comprehensively evaluate the reconstruction ability of video VAEs. Through these improvements, the paper demonstrates the superior performance of its method in multiple benchmark tests, especially in video reconstruction and generation at high resolutions and different motion speeds.