Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

Ziyao Huang,Fan Tang,Yong Zhang,Xiaodong Cun,Juan Cao,Jintao Li,Tong-Yee Lee
2024-03-25
Abstract:Despite the remarkable process of talking-head-based avatar-creating solutions, directly generating anchor-style videos with full-body motions remains challenging. In this study, we propose Make-Your-Anchor, a novel system necessitating only a one-minute video clip of an individual for training, subsequently enabling the automatic generation of anchor-style videos with precise torso and hand movements. Specifically, we finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances. We adopt a two-stage training strategy for the diffusion model, effectively binding movements with specific appearances. To produce arbitrary long temporal video, we extend the 2D U-Net in the frame-wise diffusion model to a 3D style without additional training cost, and a simple yet effective batch-overlapped temporal denoising module is proposed to bypass the constraints on video length during inference. Finally, a novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos. Comparative experiments demonstrate the effectiveness and superiority of the system in terms of visual quality, temporal coherence, and identity preservation, outperforming SOTA diffusion/non-diffusion methods. Project page: \url{
Computer Science
What problem does this paper attempt to address?
The paper aims to address the problem of automatically generating high-quality 2D digital anchor videos with vivid expressions, torso movements, and gestures. Specifically, the study proposes a diffusion model-based 2D digital human generation framework named "Make-Your-Anchor," which can automatically generate anchor-style videos with precise torso and hand movements from a personal video of approximately 1 minute. To achieve this goal, the main contributions of the paper are as follows: 1. **Proposed a customized 2D digital human generation system**: The "Make-Your-Anchor" system can generate practical and applicable digital anchor videos with vivid lip movements, expressions, gestures, and body movements. 2. **Proposed a frame-level motion-to-appearance diffusion method**: By binding motion and appearance through a two-stage training strategy and a batch overlapping temporal denoising scheme, it generates consistent human videos over a long duration. 3. **Introduced an identity-specific face enhancement module based on inpainting**: To improve the visual quality of the facial region in the output video. The method of the paper mainly includes the following aspects: - **Structure-Guided Diffusion Model (SGDM)**: Used to map 3D human mesh sequences to real human videos. - **Two-stage training strategy**: The pre-training stage enhances the model's motion generation capability; the fine-tuning stage enables the model to bind specific individual actions and appearances. - **Batch overlapping temporal denoising**: Allows generating temporally coherent videos of any length without additional training costs. - **Identity-specific face enhancement module**: Significantly improves the quality of facial details through inpainting-based techniques. The experimental section verifies the effectiveness of the proposed method. Compared to existing GAN baseline methods and other diffusion models, it achieves better performance in terms of visual quality, temporal coherence, and identity preservation. Additionally, the paper conducts detailed ablation experiments to verify the effectiveness of each component.