Abstract:Customized video generation aims to generate high-quality videos guided by text prompts and subject's reference images. However, since it is only trained on static images, the fine-tuning process of subject learning disrupts abilities of video diffusion models (VDMs) to combine concepts and generate motions. To restore these abilities, some methods use additional video similar to the prompt to fine-tune or guide the model. This requires frequent changes of guiding videos and even re-tuning of the model when generating different motions, which is very inconvenient for users. In this paper, we propose CustomCrafter, a novel framework that preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery. For preserving conceptual combination ability, we design a plug-and-play module to update few parameters in VDMs, enhancing the model's ability to capture the appearance details and the ability of concept combinations for new subjects. For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage. Therefore, we propose Dynamic Weighted Video Sampling Strategy. Using the pluggability of our subject learning modules, we reduce the impact of this module on motion generation in the early stage of denoising, preserving the ability to generate motion of VDMs. In the later stage of denoising, we restore this module to repair the appearance details of the specified subject, thereby ensuring the fidelity of the subject's appearance. Experimental results show that our method has a significant improvement compared to previous methods.

What problem does this paper attempt to address?

The paper aims to address three major challenges in custom video generation: consistent subject appearance, free combination of concepts, and smooth motion generation. Specifically, existing methods often compromise the concept combination ability and motion generation capability of video diffusion models (VDMs) when fine-tuning with static images. To tackle this issue, some methods introduce additional videos similar to the prompt content to guide or fine-tune the model. However, this approach requires frequent replacement of guiding videos or even re-fine-tuning the model, causing inconvenience to users. The paper proposes a new framework called CustomCrafter, which retains the model's concept combination ability and motion generation capability without using additional videos and without the need for further fine-tuning. By designing a plug-in module to update a small number of parameters in VDMs and proposing a Dynamic Weighted Video Sampling Strategy, this method can dynamically adjust the learning weights at different stages during the generation process. This ensures both natural motion generation and consistency in subject details. Experimental results show that CustomCrafter significantly outperforms existing methods in terms of concept combination ability, motion smoothness, and subject appearance consistency, without the need for additional videos as guidance. This allows users to conveniently generate high-quality videos that match specific prompts.

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

MotionCrafter: One-Shot Motion Customization of Diffusion Models

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models

DreamVideo: Composing Your Dream Videos with Customized Subject and Motion

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

MotionBooth: Motion-Aware Customized Text-to-Video Generation

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

VEnhancer: Generative Space-Time Enhancement for Video Generation