Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Jinbo Xing,Menghan Xia,Yuxin Liu,Yuechen Zhang,Yong Zhang,Yingqing He,Hanyuan Liu,Haoxin Chen,Xiaodong Cun,Xintao Wang,Ying Shan,Tien-Tsin Wong
2023-06-02
Abstract:Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient in conveying the overall scene context, it may be insufficient to control precisely. In this paper, we explore customized video generation by utilizing text as context description and motion structure (e.g. frame-wise depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules. This two-stage learning scheme not only reduces the computing resources required, but also improves the performance by transferring the rich concepts available in image datasets solely into video generation. Moreover, we use a simple yet effective causal attention mask strategy to enable longer video synthesis, which mitigates the potential quality degradation effectively. Experimental results show the superiority of our method over existing baselines, particularly in terms of temporal coherence and fidelity to users' guidance. In addition, our model enables several intriguing applications that demonstrate potential for practical usage.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the following key issues: ### Research Background and Objectives - **Challenges in Video Creation**: The current video creation process often requires professional knowledge in computer graphics, modeling, and animation production, making it difficult for non-professionals to easily transform their ideas into high-quality video content. - **Limitations of Existing Technologies**: Although there have been advancements in Text-to-Video (T2V) synthesis technology, relying solely on text prompts may not be sufficient to precisely control video generation. ### Problems Addressed - **Customized Video Generation**: By combining text descriptions and motion structure guidance (such as frame-level depth maps), explore a method that can generate customized videos that align with user intentions. - **Improving Video Quality and Coherence**: Enhance video quality and temporal coherence while maintaining low computational resource requirements. - **Expanding Application Scenarios**: Develop video generation technology that can be applied to various scenarios, including real-life scenes to video, dynamic 3D scene modeling to video, and video re-rendering. ### Technical Solution - **Method Overview**: Propose a model named "Make-Your-Video," which is based on the Latent Diffusion Model (LDM). It uses text as scene descriptions and frame-level depth as specific guidance to generate videos. - **Key Technical Points**: - Utilize a pre-trained image LDM and introduce temporal modules to meet the needs of video generation. - Propose a simple and effective causal attention mask strategy to support the synthesis of longer videos while mitigating potential quality degradation issues. ### Experimental Results - **Performance**: Experimental results show that this method surpasses existing baseline methods in terms of video quality, especially temporal coherence and fidelity to user guidance. - **Application Examples**: Demonstrate several interesting application cases, indicating the practical potential of this method, such as using simple physical models or manually constructed scenes to guide video generation. ### Main Contributions 1. **Efficient Customized Video Generation Method**: By introducing text and structural guidance, achieve controllable text-to-video generation with excellent qualitative and quantitative performance. 2. **Using Pre-trained Image LDM for Video Generation**: Propose a mechanism that utilizes pre-trained image LDM for video generation, inheriting rich visual concepts while achieving good temporal coherence. 3. **Long-duration Video Synthesis Technology**: Introduce a temporal mask mechanism that allows for the synthesis of longer videos while alleviating quality degradation issues. In summary, the goal of this research is to develop a method that can efficiently generate high-quality, customized videos by combining text descriptions and structural guidance, and to validate its effectiveness and practicality through experiments.