Abstract:Talking face generation is a demanding task to synthesize a high quality video with accurate lip synchronization and rhythmic head motion. However, existing methods always suffer from unrealistic facial animations, because 1) they only take single-mode input, but ignore the complementarity of multimodal inputs for lip-sync improvement; 2) they only explore lip movements, but ignore the articulator synergy between lips and jaw; 3) they generate each video frame in a temporal-independent way, but ignore the temporal continuity among the entire video. To address these limitations, in this paper, we present a novel method to generate realistic and temporally coherent talking heads by considering multimodal inputs, articulator synergy, inter-frame consistency and intra-frame consistency. Firstly, for landmark prediction, a novel Multiple Synergy Network (MSN) is proposed to improve the accuracy of landmark prediction by incorporating multimodal inputs (i.e., audio and text inputs). Besides, instead of merely considering lip landmarks, we also explore the jaw movements to ensure articulator synergy among lips and jaw. Secondly, for realistic video generation, a Video Consistency Network (VCN) is proposed conditioned on the predicted landmarks. In VCN, the optical flow is adopted to model the temporal continuity between frames to ensure inter-frame consistency. Meanwhile, a mouth generation branch is proposed to enhance mouth texture and the corresponding mouth mask is employed to ensure intra-frame consistency between the mouth area and the others. Extensive experiments demonstrate that our approach exhibits excellent superiority on lip-sync and can generate photo-realistic facial animations. Project is available at http://imcc.ustc.edu.cn/project/tfgen/.

Talking Face Generation Via Learning Semantic and Temporal Synchronous Landmarks

Generating Talking Face Landmarks from Speech

Talking Face Generation With Audio-Deduced Emotional Landmarks

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Audio-driven Talking Face Video Generation with Natural Head Pose

TalkingFlow: Talking Facial Landmark Generation with Multi-Scale Normalizing Flow Network

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Multimodal Inputs Driven Talking Face Generation With Spatial–Temporal Dependency

Spatially and Temporally Optimized Audio‐Driven Talking Face Generation

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

FLNet: Landmark Driven Fetching and Learning Network for Faithful Talking Facial Animation Synthesis

SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

Talking Faces: Audio-to-Video Face Generation

UniFLG: Unified Facial Landmark Generator from Text or Speech

Talking face generation driven by time-frequency domain features of speech audio

Multimodal Learning for Temporally Coherent Talking Face Generation with Articulator Synergy

Meta Talk: Learning To Data-Efficiently Generate Audio-Driven Lip-Synchronized Talking Face With High Definition

Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait

Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation