StableAnimator: High-Quality Identity-Preserving Human Image Animation

Shuyuan Tu,Zhen Xing,Xintong Han,Zhi-Qi Cheng,Qi Dai,Chong Luo,Zuxuan Wu
2024-11-27
Abstract:Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses. Building upon a video diffusion model, StableAnimator contains carefully designed modules for both training and inference striving for identity consistency. In particular, StableAnimator begins by computing image and face embeddings with off-the-shelf extractors, respectively and face embeddings are further refined by interacting with image embeddings using a global content-aware Face Encoder. Then, StableAnimator introduces a novel distribution-aware ID Adapter that prevents interference caused by temporal layers while preserving ID via alignment. During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality. We demonstrate that solving the HJB equation can be integrated into the diffusion denoising process, and the resulting solution constrains the denoising path and thus benefits ID preservation. Experiments on multiple benchmarks show the effectiveness of StableAnimator both qualitatively and quantitatively.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to maintain identity (ID) consistency when generating high - quality human image animations. Current diffusion models have difficulty ensuring identity consistency when generating human image animations, especially when dealing with pose sequences with significant motion changes, which are prone to severe distortion and inconsistency in the facial area, thus destroying identity information. Therefore, this paper proposes StableAnimator, which is the first end - to - end identity - preserving video diffusion framework that can synthesize high - quality videos without any post - processing, conditional on a reference image and a series of poses. ### Main contributions: 1. **Global Content - aware Face Encoder** and **Distribution - aware ID Adapter**: These two modules enable the video diffusion model to integrate facial embeddings without sacrificing video fidelity. 2. **Facial optimization method based on the Hamilton - Jacobi - Bellman (HJB) equation**: This method further enhances facial quality while performing conventional denoising, is activated only in the inference stage, and does not require training any diffusion components. To the best of our knowledge, this is the first exploration of video diffusion for end - to - end identity - preserving human image animation. 3. **Experimental results**: The experimental results on the benchmark datasets show that our model outperforms existing methods in multiple metrics, especially in facial similarity (CSIM) and video fidelity (FVD). ### Method overview: - **Training phase**: - **Global Content - aware Face Encoder**: Interacts with the reference image embedding through multiple cross - attention blocks to enhance the global context - awareness ability of the facial embedding. - **Distribution - aware ID Adapter**: Aligns the refined facial embedding and the diffusion latent variables to avoid feature distortion introduced by the time layer. - **Inference stage**: - **Facial optimization based on the HJB equation**: Guides the direction of the denoising process by solving the HJB equation, maximizes identity consistency, and reduces detail distortion. ### Experimental results: - **Quantitative results**: The experimental results on the TikTok dataset and the Unseen100 dataset show that StableAnimator significantly outperforms existing methods in facial similarity (CSIM) and video fidelity (FVD). - **Qualitative results**: Visualization results show that StableAnimator can accurately generate animations according to the given pose sequence while maintaining the consistency of the reference identity and performs excellently. ### Ablation study: - **Impact of core components**: Removing the core components will significantly reduce performance, especially in the face - related area (CSIM). - **Comparison of facial enhancement methods**: Compared with other facial enhancement methods, StableAnimator significantly improves facial quality while maintaining video fidelity. Through these contributions, StableAnimator solves the current challenges of maintaining identity consistency in human image animation generation and provides a new solution for high - quality animation generation.