Abstract:Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the VQVAE framework, while most diffusion-based video models capture the distribution of continuous latent extracted by 2D VAEs without quantization. The temporal compression is simply realized by uniform frame sampling which results in unsmooth motion between consecutive frames. Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models in the research community. Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization. To address this issue, we propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion (SD). The compatibility is achieved by the proposed novel latent space regularization, which involves formulating a regularization loss using the image VAE. Benefiting from the latent space compatibility, video models can be trained seamlessly from pre-trained T2I or video models in a truly spatio-temporally compressed latent space, rather than simply sampling video frames at equal intervals. With our CV-VAE, existing video models can generate four times more frames with minimal finetuning. Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the lack of a continuous 3D variational auto - encoder (VAE) in current video generation models that is compatible with existing pre - trained image and video models. Specifically: 1. **Compatibility issue**: Existing diffusion models usually use pre - trained text - to - image (T2I) models as initialization. Directly training a video VAE without considering compatibility with these T2I models will lead to inconsistency in the latent space, which requires a large amount of computing resources to bridge this gap. For example, Figure 2 shows that when the video VAE is trained independently, due to the inconsistency in the latent space, the SVD model cannot correctly project the sampled latent variables to the pixel space. 2. **Spatio - temporal compression issue**: Current video generation models usually achieve compression in the time dimension by uniformly sampling frames, which results in unsmooth motion between adjacent frames. The paper proposes a new method to achieve true spatio - temporal compression by introducing 3D convolution, thereby generating smoother and higher - frame - rate videos. 3. **Latent space alignment issue**: In order to ensure the alignment of the latent space of the video VAE with existing image VAEs (such as Stable Diffusion), the paper proposes a latent space regularization method. By introducing a regularization loss during the training process, the shift of the latent distribution can be avoided, thereby ensuring that the latent space of the video VAE remains consistent with that of the image VAE. The main contributions of the paper include: - Proposing a video VAE (CV - VAE) that is compatible with existing pre - trained image and video models and can perform compression in true time and space. - Designing a latent space regularization method to ensure the alignment of the latent space of the video VAE with that of the image VAE. - Verifying the effectiveness of the proposed CV - VAE through experiments and demonstrating its superior performance in video generation. Through these improvements, CV - VAE can not only generate higher - quality videos, but also be fine - tuned on the basis of existing models to improve the efficiency and quality of video generation.

CV-VAE: A Compatible Video VAE for Latent Generative Video Models

CV-VAE: A Compatible Video VAE for Latent Generative Video Models

Improved Video VAE for Latent Video Diffusion Model

OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from Low-Dimensional Latents

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Video Probabilistic Diffusion Models in Projected Latent Space

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

Variational Diffusion Auto-encoder: Latent Space Extraction from Pre-trained Diffusion Models

VEnhancer: Generative Space-Time Enhancement for Video Generation

Predicting Video with VQVAE

REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

REDUCIO! Generating 1024×1024 Video Within 16 Seconds Using Extremely Compressed Motion Latents

Video Diffusion Models

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models