Do As I Do: Pose Guided Human Motion Copy

Sifan Wu,Zhenguang Liu,Beibei Zhang,Roger Zimmermann,Zhongjie Ba,Xiaosong Zhang,Kui Ren

2024-06-24

Abstract:Human motion copy is an intriguing yet challenging task in artificial intelligence and computer vision, which strives to generate a fake video of a target person performing the motion of a source person. The problem is inherently challenging due to the subtle human-body texture details to be generated and the temporal consistency to be considered. Existing approaches typically adopt a conventional GAN with an L1 or L2 loss to produce the target fake video, which intrinsically necessitates a large number of training samples that are challenging to acquire. Meanwhile, current methods still have difficulties in attaining realistic image details and temporal consistency, which unfortunately can be easily perceived by human observers. Motivated by this, we try to tackle the issues from three aspects: (1) We constrain pose-to-appearance generation with a perceptual loss and a theoretically motivated Gromov-Wasserstein loss to bridge the gap between pose and appearance. (2) We present an episodic memory module in the pose-to-appearance generation to propel continuous learning that helps the model learn from its past poor generations. We also utilize geometrical cues of the face to optimize facial details and refine each key body part with a dedicated local GAN. (3) We advocate generating the foreground in a sequence-to-sequence manner rather than a single-frame manner, explicitly enforcing temporal inconsistency. Empirical results on five datasets, iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets, demonstrate that our method is capable of generating realistic target videos while precisely copying motion from a source video. Our method significantly outperforms state-of-the-art approaches and gains 7.2% and 12.4% improvements in PSNR and FID respectively.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to generate fake videos of target persons performing the actions of source persons in artificial intelligence and computer vision. Specifically, the challenges faced by this task include: 1. **Detail Generation**: The generated target video needs to contain subtle human body texture details, which are very important for human observers. 2. **Temporal Consistency**: The generated video needs to maintain temporal consistency, that is, there should be no obvious discontinuities between adjacent frames. 3. **Insufficient Training Samples**: Existing methods usually rely on a large number of training samples, but in practical applications, only a limited number of target - person videos can often be obtained. To solve these problems, the paper proposes improvements in the following three aspects: 1. **Perceptual Loss and Gromov - Wasserstein Loss**: By introducing perceptual loss and theoretically motivated Gromov - Wasserstein loss, the gap between pose and appearance is bridged, the dependence on a large number of training samples is reduced, and more realistic results are generated. 2. **Episodic Memory Module**: An episodic memory module is introduced in the pose - to - appearance generation process, enabling the model to continuously learn from past poor generations. At the same time, geometric cues of the face are used to optimize facial details, and a specialized local GAN is used to refine key body parts. 3. **Sequence - to - Sequence Generation**: The foreground is generated in a sequence - to - sequence manner instead of single - frame generation, thereby explicitly enforcing temporal consistency and improving the temporal consistency of the generated video. Through these improvements, the method in this paper has been experimentally verified on multiple datasets (such as iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets). The results show that this method can generate realistic target videos while accurately replicating the actions in the source videos. Compared with existing methods, this method has increased by 7.2% and 12.4% in PSNR and FID metrics respectively.

Do As I Do: Pose Guided Human Motion Copy

Do as I Do: Pose Guided Human Motion Copy

OAW-GAN: Occlusion-Aware Warping GAN for Unified Human Video Synthesis

Copy Motion from One to Another: Fake Motion Video Generation

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Pose Guided Human Video Generation

Human Motion Transfer from Poses in the Wild

Human Motion Transfer With 3D Constraints and Detail Enhancement

Audio-driven Talking Face Video Generation with Natural Head Pose

Supervised Video-To-Video Synthesis For Single Human Pose Transfer

Image Comes Dancing With Collaborative Parsing-Flow Video Synthesis

Restore DeepFakes Video Frames Via Identifying Individual Motion Styles

Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

Action2video: Generating Videos of Human 3D Actions

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

Pose-Guided Fine-Grained Sign Language Video Generation

Poxture: Human Posture Imitation Using Neural Texture

Realistic Speech-Driven Talking Video Generation with Personalized Pose

Talking-head Generation with Rhythmic Head Motion

Video-based Characters

VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation