Abstract:Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses. Building upon a video diffusion model, StableAnimator contains carefully designed modules for both training and inference striving for identity consistency. In particular, StableAnimator begins by computing image and face embeddings with off-the-shelf extractors, respectively and face embeddings are further refined by interacting with image embeddings using a global content-aware Face Encoder. Then, StableAnimator introduces a novel distribution-aware ID Adapter that prevents interference caused by temporal layers while preserving ID via alignment. During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality. We demonstrate that solving the HJB equation can be integrated into the diffusion denoising process, and the resulting solution constrains the denoising path and thus benefits ID preservation. Experiments on multiple benchmarks show the effectiveness of StableAnimator both qualitatively and quantitatively.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to maintain identity (ID) consistency when generating high - quality human image animations. Current diffusion models have difficulty ensuring identity consistency when generating human image animations, especially when dealing with pose sequences with significant motion changes, which are prone to severe distortion and inconsistency in the facial area, thus destroying identity information. Therefore, this paper proposes StableAnimator, which is the first end - to - end identity - preserving video diffusion framework that can synthesize high - quality videos without any post - processing, conditional on a reference image and a series of poses. ### Main contributions: 1. **Global Content - aware Face Encoder** and **Distribution - aware ID Adapter**: These two modules enable the video diffusion model to integrate facial embeddings without sacrificing video fidelity. 2. **Facial optimization method based on the Hamilton - Jacobi - Bellman (HJB) equation**: This method further enhances facial quality while performing conventional denoising, is activated only in the inference stage, and does not require training any diffusion components. To the best of our knowledge, this is the first exploration of video diffusion for end - to - end identity - preserving human image animation. 3. **Experimental results**: The experimental results on the benchmark datasets show that our model outperforms existing methods in multiple metrics, especially in facial similarity (CSIM) and video fidelity (FVD). ### Method overview: - **Training phase**: - **Global Content - aware Face Encoder**: Interacts with the reference image embedding through multiple cross - attention blocks to enhance the global context - awareness ability of the facial embedding. - **Distribution - aware ID Adapter**: Aligns the refined facial embedding and the diffusion latent variables to avoid feature distortion introduced by the time layer. - **Inference stage**: - **Facial optimization based on the HJB equation**: Guides the direction of the denoising process by solving the HJB equation, maximizes identity consistency, and reduces detail distortion. ### Experimental results: - **Quantitative results**: The experimental results on the TikTok dataset and the Unseen100 dataset show that StableAnimator significantly outperforms existing methods in facial similarity (CSIM) and video fidelity (FVD). - **Qualitative results**: Visualization results show that StableAnimator can accurately generate animations according to the given pose sequence while maintaining the consistency of the reference identity and performs excellently. ### Ablation study: - **Impact of core components**: Removing the core components will significantly reduce performance, especially in the face - related area (CSIM). - **Comparison of facial enhancement methods**: Compared with other facial enhancement methods, StableAnimator significantly improves facial quality while maintaining video fidelity. Through these contributions, StableAnimator solves the current challenges of maintaining identity consistency in human image animation generation and provides a new solution for high - quality animation generation.

StableAnimator: High-Quality Identity-Preserving Human Image Animation

ID-Animator: Zero-Shot Identity-Preserving Human Video Generation

VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

StableIdentity: Inserting Anybody into Anywhere at First Sight

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

HiFiVFS: High Fidelity Video Face Swapping

High Quality Human Image Animation using Regional Supervision and Motion Blur Condition

AniFaceDiff: Animating Stylized Avatars via Parametric Conditioned Diffusion Models

HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation

Dormant: Defending against Pose-driven Human Image Animation

MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware Diffusion

Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation

Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models

CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention