Abstract:Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient in conveying the overall scene context, it may be insufficient to control precisely. In this paper, we explore customized video generation by utilizing text as context description and motion structure (e.g. frame-wise depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules. This two-stage learning scheme not only reduces the computing resources required, but also improves the performance by transferring the rich concepts available in image datasets solely into video generation. Moreover, we use a simple yet effective causal attention mask strategy to enable longer video synthesis, which mitigates the potential quality degradation effectively. Experimental results show the superiority of our method over existing baselines, particularly in terms of temporal coherence and fidelity to users' guidance. In addition, our model enables several intriguing applications that demonstrate potential for practical usage.

What problem does this paper attempt to address?

The paper aims to address the following key issues: ### Research Background and Objectives - **Challenges in Video Creation**: The current video creation process often requires professional knowledge in computer graphics, modeling, and animation production, making it difficult for non-professionals to easily transform their ideas into high-quality video content. - **Limitations of Existing Technologies**: Although there have been advancements in Text-to-Video (T2V) synthesis technology, relying solely on text prompts may not be sufficient to precisely control video generation. ### Problems Addressed - **Customized Video Generation**: By combining text descriptions and motion structure guidance (such as frame-level depth maps), explore a method that can generate customized videos that align with user intentions. - **Improving Video Quality and Coherence**: Enhance video quality and temporal coherence while maintaining low computational resource requirements. - **Expanding Application Scenarios**: Develop video generation technology that can be applied to various scenarios, including real-life scenes to video, dynamic 3D scene modeling to video, and video re-rendering. ### Technical Solution - **Method Overview**: Propose a model named "Make-Your-Video," which is based on the Latent Diffusion Model (LDM). It uses text as scene descriptions and frame-level depth as specific guidance to generate videos. - **Key Technical Points**: - Utilize a pre-trained image LDM and introduce temporal modules to meet the needs of video generation. - Propose a simple and effective causal attention mask strategy to support the synthesis of longer videos while mitigating potential quality degradation issues. ### Experimental Results - **Performance**: Experimental results show that this method surpasses existing baseline methods in terms of video quality, especially temporal coherence and fidelity to user guidance. - **Application Examples**: Demonstrate several interesting application cases, indicating the practical potential of this method, such as using simple physical models or manually constructed scenes to guide video generation. ### Main Contributions 1. **Efficient Customized Video Generation Method**: By introducing text and structural guidance, achieve controllable text-to-video generation with excellent qualitative and quantitative performance. 2. **Using Pre-trained Image LDM for Video Generation**: Propose a mechanism that utilizes pre-trained image LDM for video generation, inheriting rich visual concepts while achieving good temporal coherence. 3. **Long-duration Video Synthesis Technology**: Introduce a temporal mask mechanism that allows for the synthesis of longer videos while alleviating quality degradation issues. In summary, the goal of this research is to develop a method that can efficiently generate high-quality, customized videos by combining text descriptions and structural guidance, and to validate its effectiveness and practicality through experiments.

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

ControlVideo: Training-free Controllable Text-to-Video Generation

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Motion Prompting: Controlling Video Generation with Motion Trajectories

LivePhoto: Real Image Animation with Text-guided Motion Control

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

Animate Your Motion: Turning Still Images into Dynamic Videos

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

VideoBooth: Diffusion-based Video Generation with Image Prompts

Sketch Me A Video

VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Motion Control for Enhanced Complex Action Video Generation

DreamVideo: Composing Your Dream Videos with Customized Subject and Motion

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

Text-Animator: Controllable Visual Text Video Generation

VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation