VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet

Zhihao Hu,Dong Xu

2023-08-03

Abstract:Recently, diffusion models like StableDiffusion have achieved impressive image generation results. However, the generation process of such diffusion models is uncontrollable, which makes it hard to generate videos with continuous and consistent content. In this work, by using the diffusion model with ControlNet, we proposed a new motion-guided video-to-video translation framework called VideoControlNet to generate various videos based on the given prompts and the condition from the input video. Inspired by the video codecs that use motion information for reducing temporal redundancy, our framework uses motion information to prevent the regeneration of the redundant areas for content consistency. Specifically, we generate the first frame (i.e., the I-frame) by using the diffusion model with ControlNet. Then we generate other key frames (i.e., the P-frame) based on the previous I/P-frame by using our newly proposed motion-guided P-frame generation (MgPG) method, in which the P-frames are generated based on the motion information and the occlusion areas are inpainted by using the diffusion model. Finally, the rest frames (i.e., the B-frame) are generated by using our motion-guided B-frame interpolation (MgBI) module. Our experiments demonstrate that our proposed VideoControlNet inherits the generation capability of the pre-trained large diffusion model and extends the image diffusion model to the video diffusion model by using motion information. More results are provided at our project page.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issues of continuity and content consistency in video generation. Specifically, existing diffusion models (such as StableDiffusion) perform well in image generation but struggle to maintain continuity and consistency between frames in video generation. To solve this problem, the authors propose a new motion-guided video-to-video translation framework called VideoControlNet. #### Main Contributions 1. **New Framework**: A new framework based on diffusion models and ControlNet—VideoControlNet—is proposed, which generates continuous and content-consistent videos by utilizing motion information from the input video. 2. **Motion-Guided P-Frame Generation Module (MgPG)**: To generate P-frames, a new motion-guided P-frame generation module is proposed, which uses motion information from the input video to maintain content consistency and employs diffusion models to fill in newly appearing areas. 3. **Motion-Guided B-Frame Interpolation Module (MgBI)**: A motion-guided B-frame interpolation module is also proposed to generate the remaining B-frames based on reference I/P-frames. 4. **Experimental Validation**: Experimental results show that the proposed method inherits the generative capabilities of pre-trained large diffusion models and extends image diffusion models to video diffusion models, thereby generating high-quality and continuous content. Through these improvements, VideoControlNet can generate videos with different styles or content given different text prompts, while maintaining good continuity and consistency.

VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

Dual-Stream Diffusion Net for Text-to-Video Generation

Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation

EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation

MoVideo: Motion-Aware Video Generation with Diffusion Models

ControlVideo: Training-free Controllable Text-to-Video Generation

Controllable Longer Image Animation with Diffusion Models

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

Motion-Conditioned Diffusion Model for Controllable Video Synthesis

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control

Video Diffusion Models are Training-free Motion Interpreter and Controller

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

MV-Diffusion: Motion-aware Video Diffusion Model