RealisDance: Equip controllable character animation with realistic hands

Jingkai Zhou,Benzhi Wang,Weihua Chen,Jingqi Bai,Dongyang Li,Aixi Zhang,Hao Xu,Mingyang Yang,Fan Wang

2024-09-10

Abstract:Controllable character animation is an emerging task that generates character videos controlled by pose sequences from given character images. Although character consistency has made significant progress via reference UNet, another crucial factor, pose control, has not been well studied by existing methods yet, resulting in several issues: 1) The generation may fail when the input pose sequence is corrupted. 2) The hands generated using the DWPose sequence are blurry and unrealistic. 3) The generated video will be shaky if the pose sequence is not smooth enough. In this paper, we present RealisDance to handle all the above issues. RealisDance adaptively leverages three types of poses, avoiding failed generation caused by corrupted pose sequences. Among these pose types, HaMeR provides accurate 3D and depth information of hands, enabling RealisDance to generate realistic hands even for complex gestures. Besides using temporal attention in the main UNet, RealisDance also inserts temporal attention into the pose guidance network, smoothing the video from the pose condition aspect. Moreover, we introduce pose shuffle augmentation during training to further improve generation robustness and video smoothness. Qualitative experiments demonstrate the superiority of RealisDance over other existing methods, especially in hand quality.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on several key challenges in controllable character animation generation, specifically including: 1. **Unstable Generation**: - When the input pose sequence is corrupted, the generation quality of existing methods will decline significantly. For example, when there are false detections in pose estimation, the generated video may fail or be unstable. 2. **Poor Hand Quality**: - Hand images generated using existing pose estimation methods (such as DWPose, OpenPose, etc.) are usually blurry and unrealistic, lacking 3D and depth information, and it is difficult to generate complex hand gestures. 3. **Video Jitter**: - If the pose sequence is not smooth enough, the generated video will have a jittering phenomenon. Although existing methods have added a temporal attention mechanism in the main UNet, it is still not sufficient to completely eliminate this jitter. To solve these problems, the paper proposes the RealisDance model, and its main improvement points include: - **Multi - type Pose Input**: Combine three different types of pose inputs (DWPose, SMPL - CS, HaMeR) to improve the robustness of generation and the realism of hands. - **Adaptive Pose Gating Module**: Fuse three pose features through an adaptive gating layer to ensure that even if a certain pose sequence is corrupted, the other two can still drive correct generation. - **Multi - layer Pose - guided Network**: Add a temporal attention mechanism in the network, not only in the main UNet but also in the pose - guided network, to smooth the video from multiple aspects. - **Pose Shuffling Augmentation**: Introduce pose shuffling augmentation technology during the training process to further improve the robustness of the model to incorrect pose frames and improve the smoothness of the video. Through these improvements, RealisDance is superior to existing methods in terms of generation stability, hand quality, and video smoothness.

RealisDance: Equip controllable character animation with realistic hands

XHand: Real-time Expressive Hand Avatar

EvHandPose: Event-Based 3D Hand Pose Estimation With Sparse Supervision

Robust Dancer: Long-term 3D Dance Synthesis Using Unpaired Data

DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses

Image Comes Dancing With Collaborative Parsing-Flow Video Synthesis

Diverse 3D Hand Gesture Prediction from Body Dynamics by Bilateral Hand Disentanglement

VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

DisCo: Disentangled Control for Realistic Human Dance Generation

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

HandNeRF: Neural Radiance Fields for Animatable Interacting Hands

Pose Estimation-Assisted Dance Tracking System Based on Convolutional Neural Network

DReCon

Hand Avatar: Free-Pose Hand Animation and Rendering from Monocular Video

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

RealDex: Towards Human-like Grasping for Robotic Dexterous Hand

Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss

Real-time Deep Dynamic Characters

SPA: Sparse Photorealistic Animation Using a Single RGB-D Camera

RenderIH: A Large-scale Synthetic Dataset for 3D Interacting Hand Pose Estimation