ID-Animator: Zero-Shot Identity-Preserving Human Video Generation

Xuanhua He,Quande Liu,Shengju Qian,Xin Wang,Tao Hu,Ke Cao,Keyu Yan,Jie Zhang
2024-06-26
Abstract:Generating high-fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case fine-tuning or usually missing identity details in the video generation process. In this study, we present \textbf{ID-Animator}, a zero-shot human-video generation approach that can perform personalized video generation given a single reference facial image without further training. ID-Animator inherits existing diffusion-based video generation backbones with a face adapter to encode the ID-relevant embeddings from learnable facial latent queries. To facilitate the extraction of identity information in video generation, we introduce an ID-oriented dataset construction pipeline that incorporates unified human attributes and action captioning techniques from a constructed facial image pool. Based on this pipeline, a random reference training strategy is further devised to precisely capture the ID-relevant embeddings with an ID-preserving loss, thus improving the fidelity and generalization capacity of our model for ID-specific video generation. Extensive experiments demonstrate the superiority of ID-Animator to generate personalized human videos over previous models. Moreover, our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models, showing high extendability in real-world applications for video generation where identity preservation is highly desired. Our codes and checkpoints are released at <a class="link-external link-https" href="https://github.com/ID-Animator/ID-Animator" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address several key challenges faced when generating high-fidelity, identity-specific human videos: 1. **High Training and Fine-tuning Costs**: Many existing identity-specific customization methods require cumbersome per-identity fine-tuning, leading to significant training overhead during the inference stage, hindering the widespread application and scalability of these techniques. 2. **Lack of High-Quality Text-Conditioned Human Video Datasets**: Compared to the image generation community (e.g., LAION-face), the video generation community lacks sufficient high-quality text-video data pairs, especially human videos. Existing datasets (e.g., CelebV-text) focus on annotations related to emotional changes, neglecting human attributes and actions, making them unsuitable for identity-preserving video generation tasks. 3. **Influence of Non-Identity-Related Features in Reference Images**: Non-identity-related features in reference images can affect the quality and identity preservation of the generated videos. Reducing the influence of these features is a challenge that requires new solutions to ensure fidelity in identity-specific video generation. ### Solutions To address the above issues, the authors propose an efficient identity-specific video generation framework called ID-Animator. Specifically: - **Zero-Shot Generation**: By encoding identity-related embeddings through a pre-trained text-to-video diffusion model and a lightweight facial adapter, ID-Animator can generate high-fidelity identity-specific human videos given a single reference facial image without further fine-tuning. - **Identity-Oriented Dataset Construction Pipeline**: Utilizing existing public datasets, introducing unified caption generation techniques to extract textual descriptions of human attributes and actions, and constructing a facial image pool to improve video generation quality. - **Random Reference Training Strategy**: By randomly sampling faces from the facial image pool and optimizing an identity preservation objective, the model precisely extracts identity-related features, reduces the influence of non-identity-related features, and enhances identity fidelity and generalization in practical applications. ### Core Contributions - **Proposing ID-Animator**: A framework capable of generating identity-specific videos given any reference facial image without further model fine-tuning. This is the first attempt to achieve zero-shot identity-specific human video generation. - **Constructing an Identity-Oriented Dataset Construction Pipeline**: Enhancing the quality of datasets through unified caption techniques and a facial image pool, making them more suitable for identity-preserving video generation tasks. - **Designing a Random Reference Training Strategy**: By randomly sampling facial images and optimizing the identity preservation objective, the model reduces the influence of non-identity-related features, improving identity fidelity and generalization. ### Experimental Results Experimental results show that ID-Animator outperforms existing methods on multiple metrics, including CLIP-I score, Dover score, motion score, and dynamic degree. Additionally, qualitative comparisons indicate that ID-Animator has higher identity fidelity and generalization capability in generating identity-specific videos.