Do As I Do: Pose Guided Human Motion Copy

Sifan Wu,Zhenguang Liu,Beibei Zhang,Roger Zimmermann,Zhongjie Ba,Xiaosong Zhang,Kui Ren
2024-06-24
Abstract:Human motion copy is an intriguing yet challenging task in artificial intelligence and computer vision, which strives to generate a fake video of a target person performing the motion of a source person. The problem is inherently challenging due to the subtle human-body texture details to be generated and the temporal consistency to be considered. Existing approaches typically adopt a conventional GAN with an L1 or L2 loss to produce the target fake video, which intrinsically necessitates a large number of training samples that are challenging to acquire. Meanwhile, current methods still have difficulties in attaining realistic image details and temporal consistency, which unfortunately can be easily perceived by human observers. Motivated by this, we try to tackle the issues from three aspects: (1) We constrain pose-to-appearance generation with a perceptual loss and a theoretically motivated Gromov-Wasserstein loss to bridge the gap between pose and appearance. (2) We present an episodic memory module in the pose-to-appearance generation to propel continuous learning that helps the model learn from its past poor generations. We also utilize geometrical cues of the face to optimize facial details and refine each key body part with a dedicated local GAN. (3) We advocate generating the foreground in a sequence-to-sequence manner rather than a single-frame manner, explicitly enforcing temporal inconsistency. Empirical results on five datasets, iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets, demonstrate that our method is capable of generating realistic target videos while precisely copying motion from a source video. Our method significantly outperforms state-of-the-art approaches and gains 7.2% and 12.4% improvements in PSNR and FID respectively.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate fake videos of target persons performing the actions of source persons in artificial intelligence and computer vision. Specifically, the challenges faced by this task include: 1. **Detail Generation**: The generated target video needs to contain subtle human body texture details, which are very important for human observers. 2. **Temporal Consistency**: The generated video needs to maintain temporal consistency, that is, there should be no obvious discontinuities between adjacent frames. 3. **Insufficient Training Samples**: Existing methods usually rely on a large number of training samples, but in practical applications, only a limited number of target - person videos can often be obtained. To solve these problems, the paper proposes improvements in the following three aspects: 1. **Perceptual Loss and Gromov - Wasserstein Loss**: By introducing perceptual loss and theoretically motivated Gromov - Wasserstein loss, the gap between pose and appearance is bridged, the dependence on a large number of training samples is reduced, and more realistic results are generated. 2. **Episodic Memory Module**: An episodic memory module is introduced in the pose - to - appearance generation process, enabling the model to continuously learn from past poor generations. At the same time, geometric cues of the face are used to optimize facial details, and a specialized local GAN is used to refine key body parts. 3. **Sequence - to - Sequence Generation**: The foreground is generated in a sequence - to - sequence manner instead of single - frame generation, thereby explicitly enforcing temporal consistency and improving the temporal consistency of the generated video. Through these improvements, the method in this paper has been experimentally verified on multiple datasets (such as iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets). The results show that this method can generate realistic target videos while accurately replicating the actions in the source videos. Compared with existing methods, this method has increased by 7.2% and 12.4% in PSNR and FID metrics respectively.