RealisDance: Equip controllable character animation with realistic hands

Jingkai Zhou,Benzhi Wang,Weihua Chen,Jingqi Bai,Dongyang Li,Aixi Zhang,Hao Xu,Mingyang Yang,Fan Wang
2024-09-10
Abstract:Controllable character animation is an emerging task that generates character videos controlled by pose sequences from given character images. Although character consistency has made significant progress via reference UNet, another crucial factor, pose control, has not been well studied by existing methods yet, resulting in several issues: 1) The generation may fail when the input pose sequence is corrupted. 2) The hands generated using the DWPose sequence are blurry and unrealistic. 3) The generated video will be shaky if the pose sequence is not smooth enough. In this paper, we present RealisDance to handle all the above issues. RealisDance adaptively leverages three types of poses, avoiding failed generation caused by corrupted pose sequences. Among these pose types, HaMeR provides accurate 3D and depth information of hands, enabling RealisDance to generate realistic hands even for complex gestures. Besides using temporal attention in the main UNet, RealisDance also inserts temporal attention into the pose guidance network, smoothing the video from the pose condition aspect. Moreover, we introduce pose shuffle augmentation during training to further improve generation robustness and video smoothness. Qualitative experiments demonstrate the superiority of RealisDance over other existing methods, especially in hand quality.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on several key challenges in controllable character animation generation, specifically including: 1. **Unstable Generation**: - When the input pose sequence is corrupted, the generation quality of existing methods will decline significantly. For example, when there are false detections in pose estimation, the generated video may fail or be unstable. 2. **Poor Hand Quality**: - Hand images generated using existing pose estimation methods (such as DWPose, OpenPose, etc.) are usually blurry and unrealistic, lacking 3D and depth information, and it is difficult to generate complex hand gestures. 3. **Video Jitter**: - If the pose sequence is not smooth enough, the generated video will have a jittering phenomenon. Although existing methods have added a temporal attention mechanism in the main UNet, it is still not sufficient to completely eliminate this jitter. To solve these problems, the paper proposes the RealisDance model, and its main improvement points include: - **Multi - type Pose Input**: Combine three different types of pose inputs (DWPose, SMPL - CS, HaMeR) to improve the robustness of generation and the realism of hands. - **Adaptive Pose Gating Module**: Fuse three pose features through an adaptive gating layer to ensure that even if a certain pose sequence is corrupted, the other two can still drive correct generation. - **Multi - layer Pose - guided Network**: Add a temporal attention mechanism in the network, not only in the main UNet but also in the pose - guided network, to smooth the video from multiple aspects. - **Pose Shuffling Augmentation**: Introduce pose shuffling augmentation technology during the training process to further improve the robustness of the model to incorrect pose frames and improve the smoothness of the video. Through these improvements, RealisDance is superior to existing methods in terms of generation stability, hand quality, and video smoothness.