MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation

Weimin Wang,Jiawei Liu,Zhijie Lin,Jiangqiao Yan,Shuo Chen,Chetwin Low,Tuyen Hoang,Jie Wu,Jun Hao Liew,Hanshu Yan,Daquan Zhou,Jiashi Feng
2024-01-09
Abstract:The growing demand for high-fidelity video generation from textual descriptions has catalyzed significant research in this field. In this work, we introduce MagicVideo-V2 that integrates the text-to-image model, video motion generator, reference image embedding module and frame interpolation module into an end-to-end video generation pipeline. Benefiting from these architecture designs, MagicVideo-V2 can generate an aesthetically pleasing, high-resolution video with remarkable fidelity and smoothness. It demonstrates superior performance over leading Text-to-Video systems such as Runway, Pika 1.0, Morph, Moon Valley and Stable Video Diffusion model via user evaluation at large scale.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem this paper attempts to address is generating high-quality, high-fidelity videos from textual descriptions. Specifically, the authors propose a multi-stage video generation framework named MagicVideo-V2, which aims to generate aesthetically pleasing, high-resolution, and smooth videos by integrating Text-to-Image (T2I), Image-to-Video (I2V), Video-to-Video (V2V), and Video Frame Interpolation (VFI) modules. ### Main Issues: 1. **High-Fidelity Video Generation**: Current Text-to-Video (T2V) models fall short in generating high-fidelity videos, especially in maintaining the aesthetic quality and smoothness of the video. 2. **Multi-Module Integration**: How to effectively integrate multiple different generation modules (such as T2I, I2V, V2V, and VFI) into an end-to-end video generation pipeline to improve the quality and performance of the generated videos. 3. **User Demand**: With the growing demand for high-quality video generation, how to meet users' diverse and high-quality video content requirements. ### Solution: - **Text-to-Image Module (T2I)**: Generates high-quality reference images from text prompts, capturing the aesthetic essence of the input. - **Image-to-Video Module (I2V)**: Uses the generated images as conditions to produce low-resolution keyframes. - **Video-to-Video Module (V2V)**: Performs super-resolution on the keyframes, enhancing details to generate high-resolution videos. - **Video Frame Interpolation Module (VFI)**: Inserts frames between keyframes to make video motion smoother. Through the collaborative work of these modules, MagicVideo-V2 can generate aesthetically pleasing, high-resolution, and smooth videos, and has demonstrated superior performance over existing leading T2V systems in large-scale user evaluations.