Abstract:Creating content with specified identities (ID) has attracted significant interest in the field of generative models. In the field of text-to-image generation (T2I), subject-driven creation has achieved great progress with the identity controlled via reference images. However, its extension to video generation is not well explored. In this work, we propose a simple yet effective subject identity controllable video generation framework, termed Video Custom Diffusion (VCD). With a specified identity defined by a few images, VCD reinforces the identity characteristics and injects frame-wise correlation at the initialization stage for stable video outputs. To achieve this, we propose three novel components that are essential for high-quality identity preservation and stable video generation: 1) a noise initialization method with 3D Gaussian Noise Prior for better inter-frame stability; 2) an ID module based on extended Textual Inversion trained with the cropped identity to disentangle the ID information from the background 3) Face VCD and Tiled VCD modules to reinforce faces and upscale the video to higher resolution while preserving the identity's features. We conducted extensive experiments to verify that VCD is able to generate stable videos with better ID over the baselines. Besides, with the transferability of the encoded identity in the ID module, VCD is also working well with personalized text-to-image models available publicly. The codes are available at

What problem does this paper attempt to address?

This paper attempts to address the problem of achieving precise control over specific identities (such as particular characters) and high-quality generation in video creation. Specifically: 1. **Customized Video for Specific Identities**: The paper proposes a new framework called Video Custom Diffusion (VCD), which aims to generate videos with stable actions and consistent identity features using a small number of reference images. This method addresses the issue of maintaining identity features and action stability in video generation, which existing methods struggle with. 2. **Enhancement and Preservation of Identity Features**: To tackle the two main challenges in human video generation (inconsistent actions due to complex pose changes and difficulty in capturing facial details), the paper introduces three innovative components: 3D Gaussian noise prior (to improve inter-frame stability), an identity module based on extended text inversion (to decouple background information and enhance identity features), and face VCD and collage VCD modules (to enhance facial details and improve video resolution). 3. **Stable Motion Generation**: To generate videos with coherent actions, the paper introduces a training-free 3D Gaussian noise prior to reconstruct inter-frame correlations, thereby improving action consistency. 4. **Balancing Text Alignment and Identity Similarity**: The paper also proposes an improved identity module that, through an extended text inversion method combined with a prompt-to-segmentation submodule, achieves a better balance between text alignment and identity feature retention. In summary, this research is dedicated to developing a method capable of effectively generating high-quality videos with specific identities, making significant progress particularly in the customization of human identities.

Magic-Me: Identity-Specific Video Customized Diffusion

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Ingredients: Blending Custom Photos with Video Diffusion Transformers

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

MagicNaming: Consistent Identity Generation by Finding a "Name Space" in T2I Diffusion Models

StableIdentity: Inserting Anybody into Anywhere at First Sight

MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation

I4VGen: Image as Free Stepping Stone for Text-to-Video Generation

MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

ID-Animator: Zero-Shot Identity-Preserving Human Video Generation

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

ED-T2V: an Efficient Training Framework for Diffusion-based Text-to-Video Generation.

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

DreamVideo: Composing Your Dream Videos with Customized Subject and Motion

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models