Abstract:Talking face generation is the process of synthesizing a lip-synchronized video when given a reference portrait and an audio clip. However, generating a fine-grained talking video is nontrivial due to several challenges: 1) capturing vivid facial expressions, such as muscle movements; 2) ensuring smooth transitions between consecutive frames; and 3) preserving the details of the reference portrait. Existing efforts have only focused on modeling rigid lip movements, resulting in low-fidelity videos with jerky facial muscle deformations. To address these challenges, we propose a novel Fine-gRained mOtioN moDel (FROND), consisting of three components. In the first component, we adopt a two-stream encoder to capture local facial movement keypoints and embed their overall motion context as the global code. In the second component, we design a motion estimation module to predict audio-driven movements. This enables the learning of local key point motion in the continuous trajectory space to achieve smooth temporal facial movements. Additionally, the local and global motions are fused to estimate a continuous dense motion field, resulting in spatially smooth movements. In the third component, we devise a novel implicit image decoder based on an implicit neural network. This decoder recovers high-frequency information from the input image, resulting in a high-fidelity talking face. In summary, the FROND refines the motion trajectories of facial keypoints into a continuous dense motion field, which is followed by a decoder that fully exploits the inherent smoothness of the motion. We conduct quantitative and qualitative model evaluations on benchmark datasets. The experimental results show that our proposed FROND significantly outperforms several state-of-the-art baselines.

Generating Smooth and Facial-Details-Enhanced Talking Head Video: A Perspective of Pre and Post Processes

Audio-driven Talking Face Video Generation with Natural Head Pose

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

High-Fidelity and Freely Controllable Talking Head Video Generation

Dynamic Neural Textures: Generating Talking-Face Videos with Continuously Controllable Expressions

Talking Faces: Audio-to-Video Face Generation

Meta Talk: Learning To Data-Efficiently Generate Audio-Driven Lip-Synchronized Talking Face With High Definition

Toward Fine-Grained Talking Face Generation

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Spatially and Temporally Optimized Audio‐Driven Talking Face Generation

GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

DaGAN++: Depth-Aware Generative Adversarial Network for Talking Head Video Generation

Continuously Controllable Facial Expression Editing in Talking Face Videos

Audio-Driven Emotional 3D Talking-Head Generation

Talking face generation driven by time-frequency domain features of speech audio

LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details

StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation