Magic-Me: Identity-Specific Video Customized Diffusion

Ze Ma,Daquan Zhou,Chun-Hsiao Yeh,Xue-She Wang,Xiuyu Li,Huanrui Yang,Zhen Dong,Kurt Keutzer,Jiashi Feng
2024-03-21
Abstract:Creating content with specified identities (ID) has attracted significant interest in the field of generative models. In the field of text-to-image generation (T2I), subject-driven creation has achieved great progress with the identity controlled via reference images. However, its extension to video generation is not well explored. In this work, we propose a simple yet effective subject identity controllable video generation framework, termed Video Custom Diffusion (VCD). With a specified identity defined by a few images, VCD reinforces the identity characteristics and injects frame-wise correlation at the initialization stage for stable video outputs. To achieve this, we propose three novel components that are essential for high-quality identity preservation and stable video generation: 1) a noise initialization method with 3D Gaussian Noise Prior for better inter-frame stability; 2) an ID module based on extended Textual Inversion trained with the cropped identity to disentangle the ID information from the background 3) Face VCD and Tiled VCD modules to reinforce faces and upscale the video to higher resolution while preserving the identity's features. We conducted extensive experiments to verify that VCD is able to generate stable videos with better ID over the baselines. Besides, with the transferability of the encoded identity in the ID module, VCD is also working well with personalized text-to-image models available publicly. The codes are available at
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to address the problem of achieving precise control over specific identities (such as particular characters) and high-quality generation in video creation. Specifically: 1. **Customized Video for Specific Identities**: The paper proposes a new framework called Video Custom Diffusion (VCD), which aims to generate videos with stable actions and consistent identity features using a small number of reference images. This method addresses the issue of maintaining identity features and action stability in video generation, which existing methods struggle with. 2. **Enhancement and Preservation of Identity Features**: To tackle the two main challenges in human video generation (inconsistent actions due to complex pose changes and difficulty in capturing facial details), the paper introduces three innovative components: 3D Gaussian noise prior (to improve inter-frame stability), an identity module based on extended text inversion (to decouple background information and enhance identity features), and face VCD and collage VCD modules (to enhance facial details and improve video resolution). 3. **Stable Motion Generation**: To generate videos with coherent actions, the paper introduces a training-free 3D Gaussian noise prior to reconstruct inter-frame correlations, thereby improving action consistency. 4. **Balancing Text Alignment and Identity Similarity**: The paper also proposes an improved identity module that, through an extended text inversion method combined with a prompt-to-segmentation submodule, achieves a better balance between text alignment and identity feature retention. In summary, this research is dedicated to developing a method capable of effectively generating high-quality videos with specific identities, making significant progress particularly in the customization of human identities.