VEnhancer: Generative Space-Time Enhancement for Video Generation

Jingwen He,Tianfan Xue,Dongyang Liu,Xinqi Lin,Peng Gao,Dahua Lin,Yu Qiao,Wanli Ouyang,Ziwei Liu

2024-07-10

Abstract:We present VEnhancer, a generative space-time enhancement framework that improves the existing text-to-video results by adding more details in spatial domain and synthetic detailed motion in temporal domain. Given a generated low-quality video, our approach can increase its spatial and temporal resolution simultaneously with arbitrary up-sampling space and time scales through a unified video diffusion model. Furthermore, VEnhancer effectively removes generated spatial artifacts and temporal flickering of generated videos. To achieve this, basing on a pretrained video diffusion model, we train a video ControlNet and inject it to the diffusion model as a condition on low frame-rate and low-resolution videos. To effectively train this video ControlNet, we design space-time data augmentation as well as video-aware conditioning. Benefiting from the above designs, VEnhancer yields to be stable during training and shares an elegant end-to-end training manner. Extensive experiments show that VEnhancer surpasses existing state-of-the-art video super-resolution and space-time super-resolution methods in enhancing AI-generated videos. Moreover, with VEnhancer, exisiting open-source state-of-the-art text-to-video method, VideoCrafter-2, reaches the top one in video generation benchmark -- VBench.

Computer Vision and Pattern Recognition,Image and Video Processing

What problem does this paper attempt to address?

The problem addressed in this paper is how to effectively and flexibly enhance the quality of text-to-video generation by increasing the details in spatial and temporal resolution and eliminating artifacts and flickering issues in the generated videos. VEnhancer is a generative spatio-temporal enhancement framework aimed at improving existing text-to-video results by adding more details in the spatial domain and synthesizing detailed motion in the temporal domain. For low-quality videos, this method can enhance both spatial and temporal resolution simultaneously, supporting arbitrary upsampling ratios in space and time. By using a unified video diffusion model, VEnhancer is able to remove spatial artifacts and temporal flickering in the generated videos. To achieve this goal, the researchers trained a video control network (ControlNet) based on a pre-trained video diffusion model, and injected it into the diffusion model as a condition for low-frame-rate and low-resolution videos. In addition, they designed spatio-temporal data augmentation and video-aware conditioning to effectively train this video control network. VEnhancer overcomes some limitations of existing methods, such as the need for independently trained cascaded spatio-temporal super-resolution models, support for only fixed upsampling factors, and difficulties in balancing quality and fidelity during video enhancement. Through these designs, VEnhancer surpasses the current state-of-the-art methods in enhancing AI-generated videos and enables the top open-source text-to-video method, VideoCrafter-2, to achieve first place in the VBench video generation benchmark test.

VEnhancer: Generative Space-Time Enhancement for Video Generation

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

MagicVideo: Efficient Video Generation With Latent Diffusion Models

ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models

I4VGen: Image as Free Stepping Stone for Text-to-Video Generation

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Imagen Video: High Definition Video Generation with Diffusion Models

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

3DAttGAN: A 3D Attention-based Generative Adversarial Network for Joint Space-Time Video Super-Resolution