Abstract:Talking face generation (TFG) aims to animate a target identity's face to create realistic talking videos. Personalized TFG is a variant that emphasizes the perceptual identity similarity of the synthesized result (from the perspective of appearance and talking style). While previous works typically solve this problem by learning an individual neural radiance field (NeRF) for each identity to implicitly store its static and dynamic information, we find it inefficient and non-generalized due to the per-identity-per-training framework and the limited training data. To this end, we propose MimicTalk, the first attempt that exploits the rich knowledge from a NeRF-based person-agnostic generic model for improving the efficiency and robustness of personalized TFG. To be specific, (1) we first come up with a person-agnostic 3D TFG model as the base model and propose to adapt it into a specific identity; (2) we propose a static-dynamic-hybrid adaptation pipeline to help the model learn the personalized static appearance and facial dynamic features; (3) To generate the facial motion of the personalized talking style, we propose an in-context stylized audio-to-motion model that mimics the implicit talking style provided in the reference video without information loss by an explicit style representation. The adaptation process to an unseen identity can be performed in 15 minutes, which is 47 times faster than previous person-dependent methods. Experiments show that our MimicTalk surpasses previous baselines regarding video quality, efficiency, and expressiveness. Source code and video samples are available at <a class="link-external link-https" href="https://mimictalk.github.io" rel="external noopener nofollow">this https URL</a> .

Manitalk: manipulable talking head generation from single image in the wild

Audio-driven Talking Face Video Generation with Natural Head Pose

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

MakeItTalk: Speaker-Aware Talking-Head Animation

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes

StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Talking Face Generation With Audio-Deduced Emotional Landmarks

CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding

Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait

StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads

TellMeTalk: Multimodal-driven talking face video generation

High-Fidelity and Freely Controllable Talking Head Video Generation

Talking Faces: Audio-to-Video Face Generation