MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation

Weimin Wang,Jiawei Liu,Zhijie Lin,Jiangqiao Yan,Shuo Chen,Chetwin Low,Tuyen Hoang,Jie Wu,Jun Hao Liew,Hanshu Yan,Daquan Zhou,Jiashi Feng

2024-01-09

Abstract:The growing demand for high-fidelity video generation from textual descriptions has catalyzed significant research in this field. In this work, we introduce MagicVideo-V2 that integrates the text-to-image model, video motion generator, reference image embedding module and frame interpolation module into an end-to-end video generation pipeline. Benefiting from these architecture designs, MagicVideo-V2 can generate an aesthetically pleasing, high-resolution video with remarkable fidelity and smoothness. It demonstrates superior performance over leading Text-to-Video systems such as Runway, Pika 1.0, Morph, Moon Valley and Stable Video Diffusion model via user evaluation at large scale.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The problem this paper attempts to address is generating high-quality, high-fidelity videos from textual descriptions. Specifically, the authors propose a multi-stage video generation framework named MagicVideo-V2, which aims to generate aesthetically pleasing, high-resolution, and smooth videos by integrating Text-to-Image (T2I), Image-to-Video (I2V), Video-to-Video (V2V), and Video Frame Interpolation (VFI) modules. ### Main Issues: 1. **High-Fidelity Video Generation**: Current Text-to-Video (T2V) models fall short in generating high-fidelity videos, especially in maintaining the aesthetic quality and smoothness of the video. 2. **Multi-Module Integration**: How to effectively integrate multiple different generation modules (such as T2I, I2V, V2V, and VFI) into an end-to-end video generation pipeline to improve the quality and performance of the generated videos. 3. **User Demand**: With the growing demand for high-quality video generation, how to meet users' diverse and high-quality video content requirements. ### Solution: - **Text-to-Image Module (T2I)**: Generates high-quality reference images from text prompts, capturing the aesthetic essence of the input. - **Image-to-Video Module (I2V)**: Uses the generated images as conditions to produce low-resolution keyframes. - **Video-to-Video Module (V2V)**: Performs super-resolution on the keyframes, enhancing details to generate high-resolution videos. - **Video Frame Interpolation Module (VFI)**: Inserts frames between keyframes to make video motion smoother. Through the collaborative work of these modules, MagicVideo-V2 can generate aesthetically pleasing, high-resolution, and smooth videos, and has demonstrated superior performance over existing leading T2V systems in large-scale user evaluations.

MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Magic-Me: Identity-Specific Video Customized Diffusion

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

MagicAvatar: Multimodal Avatar Generation and Animation

MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators

Imagen Video: High Definition Video Generation with Diffusion Models

FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline

VEnhancer: Generative Space-Time Enhancement for Video Generation

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

MoVideo: Motion-Aware Video Generation with Diffusion Models

Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation

Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation