Abstract:Despite the remarkable process of talking-head-based avatar-creating solutions, directly generating anchor-style videos with full-body motions remains challenging. In this study, we propose Make-Your-Anchor, a novel system necessitating only a one-minute video clip of an individual for training, subsequently enabling the automatic generation of anchor-style videos with precise torso and hand movements. Specifically, we finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances. We adopt a two-stage training strategy for the diffusion model, effectively binding movements with specific appearances. To produce arbitrary long temporal video, we extend the 2D U-Net in the frame-wise diffusion model to a 3D style without additional training cost, and a simple yet effective batch-overlapped temporal denoising module is proposed to bypass the constraints on video length during inference. Finally, a novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos. Comparative experiments demonstrate the effectiveness and superiority of the system in terms of visual quality, temporal coherence, and identity preservation, outperforming SOTA diffusion/non-diffusion methods. Project page: \url{

What problem does this paper attempt to address?

The paper aims to address the problem of automatically generating high-quality 2D digital anchor videos with vivid expressions, torso movements, and gestures. Specifically, the study proposes a diffusion model-based 2D digital human generation framework named "Make-Your-Anchor," which can automatically generate anchor-style videos with precise torso and hand movements from a personal video of approximately 1 minute. To achieve this goal, the main contributions of the paper are as follows: 1. **Proposed a customized 2D digital human generation system**: The "Make-Your-Anchor" system can generate practical and applicable digital anchor videos with vivid lip movements, expressions, gestures, and body movements. 2. **Proposed a frame-level motion-to-appearance diffusion method**: By binding motion and appearance through a two-stage training strategy and a batch overlapping temporal denoising scheme, it generates consistent human videos over a long duration. 3. **Introduced an identity-specific face enhancement module based on inpainting**: To improve the visual quality of the facial region in the output video. The method of the paper mainly includes the following aspects: - **Structure-Guided Diffusion Model (SGDM)**: Used to map 3D human mesh sequences to real human videos. - **Two-stage training strategy**: The pre-training stage enhances the model's motion generation capability; the fine-tuning stage enables the model to bind specific individual actions and appearances. - **Batch overlapping temporal denoising**: Allows generating temporally coherent videos of any length without additional training costs. - **Identity-specific face enhancement module**: Significantly improves the quality of facial details through inpainting-based techniques. The experimental section verifies the effectiveness of the proposed method. Compared to existing GAN baseline methods and other diffusion models, it achieves better performance in terms of visual quality, temporal coherence, and identity preservation. Additionally, the paper conducts detailed ablation experiments to verify the effectiveness of each component.

Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model

Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation

FAAC: Facial Animation Generation with Anchor Frame and Conditional Control for Superior Fidelity and Editability

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

DreaMoving: A Human Video Generation Framework based on Diffusion Models

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text

AMG: Avatar Motion Guided Video Generation

ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance

AvatarBooth: High-Quality and Customizable 3D Human Avatar Generation

Anchored Diffusion for Video Face Reenactment

DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars

Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer

Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models

Real-time Expressive Avatar Animation Generation Based on Monocular Videos.

HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting