Abstract:Talking head generation is a significant research topic that still faces numerous challenges. Previous works often adopt generative adversarial networks or regression models, which are plagued by generation quality and average facial shape problem. Although diffusion models show impressive generative ability, their exploration in talking head generation remains unsatisfactory. This is because they either solely use the diffusion model to obtain an intermediate representation and then employ another pre-trained renderer, or they overlook the feature decoupling of complex facial details, such as expressions, head poses and appearance textures. Therefore, we propose a Facial Decoupled Diffusion model for Talking head generation called FD2Talk, which fully leverages the advantages of diffusion models and decouples the complex facial details through multi-stages. Specifically, we separate facial details into motion and appearance. In the initial phase, we design the Diffusion Transformer to accurately predict motion coefficients from raw audio. These motions are highly decoupled from appearance, making them easier for the network to learn compared to high-dimensional RGB images. Subsequently, in the second phase, we encode the reference image to capture appearance textures. The predicted facial and head motions and encoded appearance then serve as the conditions for the Diffusion UNet, guiding the frame generation. Benefiting from decoupling facial details and fully leveraging diffusion models, extensive experiments substantiate that our approach excels in enhancing image quality and generating more accurate and diverse results compared to previous state-of-the-art methods.

Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Audio-driven Talking Face Video Generation with Natural Head Pose

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

Free-HeadGAN: Neural Talking Head Synthesis with Explicit Gaze Control

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models

Realistic Speech-Driven Facial Animation with GANs

High-Fidelity and Freely Controllable Talking Head Video Generation

FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models

Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator

DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Realistic talking face animation with speech-induced head motion

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

Face Animation with an Attribute-Guided Diffusion Model