UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

Xiang Wang,Shiwei Zhang,Changxin Gao,Jiayu Wang,Xiaoqiang Zhou,Yingya Zhang,Luxin Yan,Nong Sang

2024-06-03

Abstract:Recent diffusion-based human image animation techniques have demonstrated impressive success in synthesizing videos that faithfully follow a given reference identity and a sequence of desired movement poses. Despite this, there are still two limitations: i) an extra reference model is required to align the identity image with the main video branch, which significantly increases the optimization burden and model parameters; ii) the generated video is usually short in time (e.g., 24 frames), hampering practical applications. To address these shortcomings, we present a UniAnimate framework to enable efficient and long-term human video generation. First, to reduce the optimization difficulty and ensure temporal coherence, we map the reference image along with the posture guidance and noise video into a common feature space by incorporating a unified video diffusion model. Second, we propose a unified noise input that supports random noised input as well as first frame conditioned input, which enhances the ability to generate long-term video. Finally, to further efficiently handle long sequences, we explore an alternative temporal modeling architecture based on state space model to replace the original computation-consuming temporal Transformer. Extensive experimental results indicate that UniAnimate achieves superior synthesis results over existing state-of-the-art counterparts in both quantitative and qualitative evaluations. Notably, UniAnimate can even generate highly consistent one-minute videos by iteratively employing the first frame conditioning strategy. Code and models will be publicly available. Project page: <a class="link-external link-https" href="https://unianimate.github.io/" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper proposes improvements to address two key issues in the field of human image animation synthesis: 1. **Need for an additional reference model**: Existing methods require an additional reference model to align the identity image with the video branch, which not only increases the optimization difficulty but also significantly increases the number of model parameters. 2. **Limited generated video length**: Existing methods typically can only generate short videos (e.g., 24 frames), limiting the potential for practical applications. This is because the computational complexity of the temporal Transformer used is quadratic in the temporal dimension, thus limiting the length of the generated video. To address the above issues, the paper proposes a framework called **UniAnimate**, whose core contributions include: - **Unified video diffusion model**: By mapping the reference image, pose guidance, and noise video into a common feature space, the optimization process is simplified and temporal coherence is ensured. - **Unified noise input design**: Supports random noise input as well as input based on the first frame condition, enhancing the ability to generate long-duration videos and ensuring smooth transitions between videos through the first frame condition strategy. - **Alternative temporal modeling architecture**: Adopts a state-space model-based approach to replace the original temporal Transformer, reducing computational costs and improving efficiency in handling long sequences. Experimental results show that UniAnimate outperforms existing state-of-the-art methods in both quantitative and qualitative evaluations and is capable of generating high-quality, coherent human animation videos up to 1 minute long. Additionally, user studies further validate the superior performance of this method in terms of visual quality, identity preservation, and temporal consistency.

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

OAW-GAN: Occlusion-Aware Warping GAN for Unified Human Video Synthesis

Ivs-Net: Learning Human View Synthesis from Internet Videos

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

AnimateAnything: Consistent and Controllable Animation for Video Generation

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance

Controllable Longer Image Animation with Diffusion Models

StableAnimator: High-Quality Identity-Preserving Human Image Animation

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

High Quality Human Image Animation using Regional Supervision and Motion Blur Condition

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model

UniFaceGAN: A Unified Framework for Temporally Consistent Facial Video Editing

UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing

Image-to-Video Generation via 3D Facial Dynamics

ID-Animator: Zero-Shot Identity-Preserving Human Video Generation