Abstract:Talking face generation is a demanding task to synthesize a high quality video with accurate lip synchronization and rhythmic head motion. However, existing methods always suffer from unrealistic facial animations, because 1) they only take single-mode input, but ignore the complementarity of multimodal inputs for lip-sync improvement; 2) they only explore lip movements, but ignore the articulator synergy between lips and jaw; 3) they generate each video frame in a temporal-independent way, but ignore the temporal continuity among the entire video. To address these limitations, in this paper, we present a novel method to generate realistic and temporally coherent talking heads by considering multimodal inputs, articulator synergy, inter-frame consistency and intra-frame consistency. Firstly, for landmark prediction, a novel Multiple Synergy Network (MSN) is proposed to improve the accuracy of landmark prediction by incorporating multimodal inputs (i.e., audio and text inputs). Besides, instead of merely considering lip landmarks, we also explore the jaw movements to ensure articulator synergy among lips and jaw. Secondly, for realistic video generation, a Video Consistency Network (VCN) is proposed conditioned on the predicted landmarks. In VCN, the optical flow is adopted to model the temporal continuity between frames to ensure inter-frame consistency. Meanwhile, a mouth generation branch is proposed to enhance mouth texture and the corresponding mouth mask is employed to ensure intra-frame consistency between the mouth area and the others. Extensive experiments demonstrate that our approach exhibits excellent superiority on lip-sync and can generate photo-realistic facial animations. Project is available at http://imcc.ustc.edu.cn/project/tfgen/.

TalkingFlow: Talking Facial Landmark Generation with Multi-Scale Normalizing Flow Network

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Audio-driven Talking Face Video Generation with Natural Head Pose

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

OpFlowTalker: Realistic and Natural Talking Face Generation via Optical Flow Guidance

FLNet: Landmark Driven Fetching and Learning Network for Faithful Talking Facial Animation Synthesis

Generating Talking Face Landmarks from Speech

Flow-Based Unconstrained Lip to Speech Generation

FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization

CPNet: Exploiting CLIP-based Attention Condenser and Probability Map Guidance for High-fidelity Talking Face Generation

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Multimodal Learning for Temporally Coherent Talking Face Generation with Articulator Synergy

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Multimodal Inputs Driven Talking Face Generation With Spatial–Temporal Dependency

SeamsTalk: Seamless Talking Face Generation via Flow-Guided Inpainting

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

TellMeTalk: Multimodal-driven talking face video generation

UniFLG: Unified Facial Landmark Generator from Text or Speech

Spatially and Temporally Optimized Audio‐Driven Talking Face Generation

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person