Abstract:Generating high-fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case fine-tuning or usually missing identity details in the video generation process. In this study, we present \textbf{ID-Animator}, a zero-shot human-video generation approach that can perform personalized video generation given a single reference facial image without further training. ID-Animator inherits existing diffusion-based video generation backbones with a face adapter to encode the ID-relevant embeddings from learnable facial latent queries. To facilitate the extraction of identity information in video generation, we introduce an ID-oriented dataset construction pipeline that incorporates unified human attributes and action captioning techniques from a constructed facial image pool. Based on this pipeline, a random reference training strategy is further devised to precisely capture the ID-relevant embeddings with an ID-preserving loss, thus improving the fidelity and generalization capacity of our model for ID-specific video generation. Extensive experiments demonstrate the superiority of ID-Animator to generate personalized human videos over previous models. Moreover, our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models, showing high extendability in real-world applications for video generation where identity preservation is highly desired. Our codes and checkpoints are released at <a class="link-external link-https" href="https://github.com/ID-Animator/ID-Animator" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address several key challenges faced when generating high-fidelity, identity-specific human videos: 1. **High Training and Fine-tuning Costs**: Many existing identity-specific customization methods require cumbersome per-identity fine-tuning, leading to significant training overhead during the inference stage, hindering the widespread application and scalability of these techniques. 2. **Lack of High-Quality Text-Conditioned Human Video Datasets**: Compared to the image generation community (e.g., LAION-face), the video generation community lacks sufficient high-quality text-video data pairs, especially human videos. Existing datasets (e.g., CelebV-text) focus on annotations related to emotional changes, neglecting human attributes and actions, making them unsuitable for identity-preserving video generation tasks. 3. **Influence of Non-Identity-Related Features in Reference Images**: Non-identity-related features in reference images can affect the quality and identity preservation of the generated videos. Reducing the influence of these features is a challenge that requires new solutions to ensure fidelity in identity-specific video generation. ### Solutions To address the above issues, the authors propose an efficient identity-specific video generation framework called ID-Animator. Specifically: - **Zero-Shot Generation**: By encoding identity-related embeddings through a pre-trained text-to-video diffusion model and a lightweight facial adapter, ID-Animator can generate high-fidelity identity-specific human videos given a single reference facial image without further fine-tuning. - **Identity-Oriented Dataset Construction Pipeline**: Utilizing existing public datasets, introducing unified caption generation techniques to extract textual descriptions of human attributes and actions, and constructing a facial image pool to improve video generation quality. - **Random Reference Training Strategy**: By randomly sampling faces from the facial image pool and optimizing an identity preservation objective, the model precisely extracts identity-related features, reduces the influence of non-identity-related features, and enhances identity fidelity and generalization in practical applications. ### Core Contributions - **Proposing ID-Animator**: A framework capable of generating identity-specific videos given any reference facial image without further model fine-tuning. This is the first attempt to achieve zero-shot identity-specific human video generation. - **Constructing an Identity-Oriented Dataset Construction Pipeline**: Enhancing the quality of datasets through unified caption techniques and a facial image pool, making them more suitable for identity-preserving video generation tasks. - **Designing a Random Reference Training Strategy**: By randomly sampling facial images and optimizing the identity preservation objective, the model reduces the influence of non-identity-related features, improving identity fidelity and generalization. ### Experimental Results Experimental results show that ID-Animator outperforms existing methods on multiple metrics, including CLIP-I score, Dover score, motion score, and dynamic degree. Additionally, qualitative comparisons indicate that ID-Animator has higher identity fidelity and generalization capability in generating identity-specific videos.

ID-Animator: Zero-Shot Identity-Preserving Human Video Generation

Ivs-Net: Learning Human View Synthesis from Internet Videos

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

FaceChain: A Playground for Identity-Preserving Portrait Generation

StableAnimator: High-Quality Identity-Preserving Human Image Animation

MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation

InstantID: Zero-shot Identity-Preserving Generation in Seconds

StableIdentity: Inserting Anybody into Anywhere at First Sight

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

Unified Video and Image Representation for Boosted Video Face Forgery Detection

Magic-Me: Identity-Specific Video Customized Diffusion

VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Zero-Shot Face Swapping with De-identification Adversarial Learning

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

Zero-shot High-fidelity and Pose-controllable Character Animation

Image-to-Video Generation via 3D Facial Dynamics

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

Fine-gained Zero-shot Video Sampling

Seeing is not Believing: An Identity Hider for Human Vision Privacy Protection