MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Yuang Zhang,Jiaxi Gu,Li-Wen Wang,Han Wang,Junqi Cheng,Yuefeng Zhu,Fangyuan Zou

2024-06-28

Abstract:In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications. However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder the application and popularization of this technology. In this work, we propose a controllable video generation framework, dubbed MimicMotion, which can generate high-quality videos of arbitrary length mimicking specific motion guidance. Compared with previous methods, our approach has several highlights. Firstly, we introduce confidence-aware pose guidance that ensures high frame quality and temporal smoothness. Secondly, we introduce regional loss amplification based on pose confidence, which significantly reduces image distortion. Lastly, for generating long and smooth videos, we propose a progressive latent fusion strategy. By this means, we can produce videos of arbitrary length with acceptable resource consumption. With extensive experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in various aspects. Detailed results and comparisons are available on our project page: <a class="link-external link-https" href="https://tencent.github.io/MimicMotion" rel="external noopener nofollow">this https URL</a> .

Computer Vision and Pattern Recognition,Artificial Intelligence,Multimedia

What problem does this paper attempt to address?

The paper aims to address several key challenges in video generation technology, particularly in terms of controllability, video length, and detail richness. Specifically, the research team proposed a controllable video generation framework named MimicMotion, which can generate high-quality videos of arbitrary length based on specific motion guidance. Compared to existing methods, MimicMotion has the following highlights: 1. **Confidence-Aware Pose Guidance**: By introducing confidence scores for pose sequences, it ensures high quality and temporal smoothness of video frames. This approach can reduce image distortion and mitigate the negative impact of inaccurate pose estimation during training and inference. 2. **Region Loss Amplification Based on Pose Confidence**: Significantly reduces image distortion, especially enhancing areas with high pose confidence such as hands, making these regions clearer and more accurate. 3. **Progressive Latent Fusion Strategy**: To generate long and smooth videos, a new strategy is proposed that can generate videos of arbitrary length while keeping resource consumption acceptable. This method achieves this by generating video segments with overlapping frames and then merging these segments. In summary, the goal of MimicMotion is to generate long videos while maintaining high quality and temporal coherence, particularly excelling in human motion video generation tasks based on pose guidance. Additionally, this method addresses issues such as image distortion and hand detail blurriness present in previous methods and improves the temporal smoothness of the video.

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Do as I Do: Pose Guided Human Motion Copy

Motion Prompting: Controlling Video Generation with Motion Trajectories

Copy Motion from One to Another: Fake Motion Video Generation

Pose Guided Human Video Generation

AnimateAnything: Consistent and Controllable Animation for Video Generation

AMG: Avatar Motion Guided Video Generation

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance

Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

ViMo: Generating Motions from Casual Videos

Motion Control for Enhanced Complex Action Video Generation

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Human Motion Transfer from Poses in the Wild

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Motion Mamba: Efficient and Long Sequence Motion Generation

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

Pose-Guided Fine-Grained Sign Language Video Generation