ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance

Haijie Yang,Zhenyu Zhang,Hao Tang,Jianjun Qian,Jian Yang

2024-11-23

Abstract:Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffer from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contours that vary significantly along the time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The final avatar is generated by a fully consistent diffusion module, conditioned on the aligned TSD, rough head normal, and emotion prompt embedding. We find that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head. Further, its reliable guidance complements the inaccuracy of other conditions, suppressing the accumulated error while improving the consistency on various aspects. Extensive experiments demonstrate that ConsistentAvatar outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency. Project page: <a class="link-external link-https" href="https://njust-yang.github.io/ConsistentAvatar.github.io/" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to ensure a high degree of consistency in temporal consistency, 3D consistency, and expression consistency simultaneously when generating Talking Head Avatars. Although existing methods can generate avatars with realistic appearances and speaking effects, they often have deficiencies in temporal consistency, 3D consistency, or expression consistency due to error accumulation and the inherent limitations of single - image generation capabilities. For this reason, the paper proposes the ConsistentAvatar framework, aiming to generate fully consistent and high - quality Talking Head Avatars by combining diffusion models with Temporal - Sensitive Details (TSD), 3D - aware conditions, and emotional conditions. Specifically, the main contributions of the paper include: 1. Proposing ConsistentAvatar, a diffusion - model - based neural renderer for generating Talking Head Avatars that are consistent in time, 3D, and expression. 2. Learning to align a novel Temporal - Sensitive Detail (TSD) to maintain the stability between generated frames and supplementing rough normal and emotional conditions to achieve high - fidelity generation. 3. A large number of experiments show that ConsistentAvatar outperforms the existing state - of - the - art methods in terms of the generated appearance quality, details, expressions, and temporal consistency. Through these innovations, the paper solves the inconsistency problems existing in existing methods when generating Talking Head Avatars, providing a more realistic and controllable solution for character creation in virtual reality and related applications.

ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance

Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation

DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars

ExpAvatar: High-Fidelity Avatar Generation of Unseen Expressions with 3D Face Priors

TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model

Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures

TimeWalker: Personalized Neural Space for Lifelong Head Avatars

DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models

FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model

Articulated 3D Head Avatar Generation using Text-to-Image Diffusion Models

AniFaceDiff: Animating Stylized Avatars via Parametric Conditioned Diffusion Models

GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image

DynamicAvatars: Accurate Dynamic Facial Avatars Reconstruction and Precise Editing with Diffusion Models

Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

StyleAvatar3D: Leveraging Image-Text Diffusion Models for High-Fidelity 3D Avatar Generation

AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars Using 2D Diffusion