Abstract:Audio-driven one-shot talking face generation methods are usually trained on video resources of various persons. However, their created videos often suffer unnatural mouth shapes and asynchronous lips because those methods struggle to learn a consistent speech style from different speakers. We observe that it would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements. Hence, we propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker and then transferring audio-driven motion fields to a reference image. Specifically, we develop an Audio-Visual Correlation Transformer (AVCT) that aims to infer talking motions represented by keypoint based dense motion fields from an input audio. In particular, considering audio may come from different identities in deployment, we incorporate phonemes to represent audio signals. In this manner, our AVCT can inherently generalize to audio spoken by other identities. Moreover, as face keypoints are used to represent speakers, AVCT is agnostic against appearances of the training speaker, and thus allows us to manipulate face images of different identities readily. Considering different face shapes lead to different motions, a motion field transfer module is exploited to reduce the audio-driven dense motion field gap between the training identity and the one-shot reference. Once we obtained the dense motion field of the reference image, we employ an image renderer to generate its talking face videos from an audio clip. Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements. Extensive experiments demonstrate that our synthesized videos outperform the state-of-the-art in terms of visual quality and lip-sync.

SATFace: Subject Agnostic Talking Face Generation with Natural Head Movement

Audio-driven Talking Face Video Generation with Natural Head Pose

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

Talking Faces: Audio-to-Video Face Generation

One-Shot Talking Face Generation from Single-Speaker Audio-Visual Correlation Learning

MakeItTalk: Speaker-Aware Talking-Head Animation

Realistic talking face animation with speech-induced head motion

Predicting Personalized Head Movement From Short Video and Speech Signal

Spatially and Temporally Optimized Audio‐Driven Talking Face Generation

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Meta Talk: Learning To Data-Efficiently Generate Audio-Driven Lip-Synchronized Talking Face With High Definition

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Superior and Pragmatic Talking Face Generation with Teacher-Student Framework

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Multimodal Inputs Driven Talking Face Generation With Spatial–Temporal Dependency

SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing