OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

Liuhan Chen,Zongjian Li,Bin Lin,Bin Zhu,Qian Wang,Shenghai Yuan,Xing Zhou,Xinghua Cheng,Li Yuan

2024-09-02

Abstract:Variational Autoencoder (VAE), compressing videos into latent representations, is a crucial preceding component of Latent Video Diffusion Models (LVDMs). With the same reconstruction quality, the more sufficient the VAE's compression for videos is, the more efficient the LVDMs are. However, most LVDMs utilize 2D image VAE, whose compression for videos is only in the spatial dimension and often ignored in the temporal dimension. How to conduct temporal compression for videos in a VAE to obtain more concise latent representations while promising accurate reconstruction is seldom explored. To fill this gap, we propose an omni-dimension compression VAE, named OD-VAE, which can temporally and spatially compress videos. Although OD-VAE's more sufficient compression brings a great challenge to video reconstruction, it can still achieve high reconstructed accuracy by our fine design. To obtain a better trade-off between video reconstruction quality and compression speed, four variants of OD-VAE are introduced and analyzed. In addition, a novel tail initialization is designed to train OD-VAE more efficiently, and a novel inference strategy is proposed to enable OD-VAE to handle videos of arbitrary length with limited GPU memory. Comprehensive experiments on video reconstruction and LVDM-based video generation demonstrate the effectiveness and efficiency of our proposed methods.

Computer Vision and Pattern Recognition,Image and Video Processing

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the issue in existing Latent Video Diffusion Models (LVDM) where the Variational Autoencoder (VAE) fails to fully utilize the temporal dimension when compressing videos. Specifically, most existing LVDMs use 2D image VAEs designed for images to compress videos, which leads to ineffective compression of temporal redundancies in videos. As a result, the latent representations generated by these models are not compact enough, increasing the input size for subsequent denoisers, thereby leading to higher hardware consumption and reduced video reconstruction quality. To solve this problem, the authors propose a novel Omnidimensional Variational Autoencoder (OD-VAE) that can efficiently compress videos in both temporal and spatial dimensions, resulting in more compact latent representations. Although this stronger compression poses challenges for video reconstruction, the OD-VAE can still achieve high-precision reconstruction through a carefully designed architecture. Additionally, the paper introduces four different variants of OD-VAE to balance video reconstruction quality and compression speed, and proposes novel tail initialization methods and temporal tiling strategies to improve training efficiency and inference capability. Overall, the main goal of the paper is to improve the efficiency and performance of LVDM by enhancing video compression techniques, especially in resource-constrained scenarios.

OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

Improved Video VAE for Latent Video Diffusion Model

CV-VAE: A Compatible Video VAE for Latent Generative Video Models

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

OVQE: Omniscient Network for Compressed Video Quality Enhancement

REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents

DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from Low-Dimensional Latents

High Efficiency Deep-learning Based Video Compression

REDUCIO! Generating 1024×1024 Video Within 16 Seconds Using Extremely Compressed Motion Latents

Spatial-Temporal Transformer based Video Compression Framework

VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis

Low-Light Video Enhancement via Spatial-Temporal Consistent Illumination and Reflection Decomposition

Predicting Video with VQVAE

High Efficiency Image Compression for Large Visual-Language Models

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis