Abstract:Human motion copy is an intriguing yet challenging task in artificial intelligence and computer vision, which strives to generate a fake video of a target person performing the motion of a source person. The problem is inherently challenging due to the subtle human-body texture details to be generated and the temporal consistency to be considered. Existing approaches typically adopt a conventional GAN with an L1 or L2 loss to produce the target fake video, which intrinsically necessitates a large number of training samples that are challenging to acquire. Meanwhile, current methods still have difficulties in attaining realistic image details and temporal consistency, which unfortunately can be easily perceived by human observers. Motivated by this, we try to tackle the issues from three aspects: (1) We constrain pose-to-appearance generation with a perceptual loss and a theoretically motivated Gromov-Wasserstein loss to bridge the gap between pose and appearance. (2) We present an episodic memory module in the pose-to-appearance generation to propel continuous learning that helps the model learn from its past poor generations. We also utilize geometrical cues of the face to optimize facial details and refine each key body part with a dedicated local GAN. (3) We advocate generating the foreground in a sequence-to-sequence manner rather than a single-frame manner, explicitly enforcing temporal inconsistency. Empirical results on five datasets, iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets, demonstrate that our method is capable of generating realistic target videos while precisely copying motion from a source video. Our method significantly outperforms state-of-the-art approaches and gains 7.2% and 12.4% improvements in PSNR and FID respectively.

Multi-person/Group Interactive Video Generation

CoMA: Compositional Human Motion Generation with Multi-modal Agents

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Learning a Generative Model for Multi‐Step Human‐Object Interactions from Videos

AMG: Avatar Motion Guided Video Generation

ViMo: Generating Motions from Casual Videos

Action2video: Generating Videos of Human 3D Actions

Multi-Frame Content Integration with a Spatio-Temporal Attention Mechanism for Person Video Motion Transfer

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

Composable Semi parametric Modelling for Long range Motion Generation

GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion Generation

Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models

Do as I Do: Pose Guided Human Motion Copy

Audio-Driven Co-Speech Gesture Video Generation

Human Motion Generation: A Survey

Deep Gesture Video Generation with Learning on Regions of Interest