UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

Xiang Wang,Shiwei Zhang,Changxin Gao,Jiayu Wang,Xiaoqiang Zhou,Yingya Zhang,Luxin Yan,Nong Sang
2024-06-03
Abstract:Recent diffusion-based human image animation techniques have demonstrated impressive success in synthesizing videos that faithfully follow a given reference identity and a sequence of desired movement poses. Despite this, there are still two limitations: i) an extra reference model is required to align the identity image with the main video branch, which significantly increases the optimization burden and model parameters; ii) the generated video is usually short in time (e.g., 24 frames), hampering practical applications. To address these shortcomings, we present a UniAnimate framework to enable efficient and long-term human video generation. First, to reduce the optimization difficulty and ensure temporal coherence, we map the reference image along with the posture guidance and noise video into a common feature space by incorporating a unified video diffusion model. Second, we propose a unified noise input that supports random noised input as well as first frame conditioned input, which enhances the ability to generate long-term video. Finally, to further efficiently handle long sequences, we explore an alternative temporal modeling architecture based on state space model to replace the original computation-consuming temporal Transformer. Extensive experimental results indicate that UniAnimate achieves superior synthesis results over existing state-of-the-art counterparts in both quantitative and qualitative evaluations. Notably, UniAnimate can even generate highly consistent one-minute videos by iteratively employing the first frame conditioning strategy. Code and models will be publicly available. Project page: <a class="link-external link-https" href="https://unianimate.github.io/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper proposes improvements to address two key issues in the field of human image animation synthesis: 1. **Need for an additional reference model**: Existing methods require an additional reference model to align the identity image with the video branch, which not only increases the optimization difficulty but also significantly increases the number of model parameters. 2. **Limited generated video length**: Existing methods typically can only generate short videos (e.g., 24 frames), limiting the potential for practical applications. This is because the computational complexity of the temporal Transformer used is quadratic in the temporal dimension, thus limiting the length of the generated video. To address the above issues, the paper proposes a framework called **UniAnimate**, whose core contributions include: - **Unified video diffusion model**: By mapping the reference image, pose guidance, and noise video into a common feature space, the optimization process is simplified and temporal coherence is ensured. - **Unified noise input design**: Supports random noise input as well as input based on the first frame condition, enhancing the ability to generate long-duration videos and ensuring smooth transitions between videos through the first frame condition strategy. - **Alternative temporal modeling architecture**: Adopts a state-space model-based approach to replace the original temporal Transformer, reducing computational costs and improving efficiency in handling long sequences. Experimental results show that UniAnimate outperforms existing state-of-the-art methods in both quantitative and qualitative evaluations and is capable of generating high-quality, coherent human animation videos up to 1 minute long. Additionally, user studies further validate the superior performance of this method in terms of visual quality, identity preservation, and temporal consistency.