Abstract:Recent years have witnessed great progress in audio-driven talking head animation. Among these methods, the 3D-based ones better preserve the 3D consistency of the generated head and produce more natural results compared with 2D-based approaches. However, most 3D-based methods employ 3D morphable face models as the intermediate representation and involve multi-stage training, which may lead to error accumulation. To alleviate this problem, in this paper, we propose a fully end-to-end talking head animation method, which implicitly grasps the 3D structures by learning a conditional Neural Radiance Field (NeRF). As NeRF has proven to be an effective tool for 3D modeling, one can learn dynamic neural radiance fields conditioned on audio signals for talking head synthesis. Furthermore, we argue that audio signals cannot fully drive a lifelike talking head. When people are talking, they usually show many spontaneous facial movements like blinks and brow movements, which makes talkers natural and real. These movements cannot be fully driven by the audio signals since they are highly unrelated to the audio. Therefore, we incorporate motion information as another driving factor and develop an audio-motion dual-driven NeRF model to take a step toward more lifelike talking head synthesis. On this basis, as audio and motion mainly affect different regions of the human face, we propose a Spatially-adaptive Dual-driven NeRF (SD-NeRF), which fuses these two driven factors with a spatially-adaptive cross-attention mechanism. Quantitative and qualitative results demonstrate that, with finer facial controls, our method produces more realistic talking head videos compared with existing advanced works. For more video results, including the multi-view animation and cross audio-driven results, please refer to our demonstration video https://cloud.tsinghua.edu.cn/f/7ebd663951e5403da4a5/ .

Emotional Semantic Neural Radiance Fields for Audio-Driven Talking Head.

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis

Audio-Driven Emotional 3D Talking-Head Generation

S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis

SD-NeRF: Towards Lifelike Talking Head Animation Via Spatially-Adaptive Dual-Driven NeRFs

Audio-driven Talking Face Video Generation with Natural Head Pose

3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head

Real-time Neural Radiance Talking Portrait Synthesis Via Audio-spatial Decomposition

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks

ER-NeRF++: Efficient region-aware Neural Radiance Fields for high-fidelity talking portrait synthesis

AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

NeRF-AD: Neural Radiance Field with Attention-based Disentanglement for Talking Face Synthesis

DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering

Instruct-NeuralTalker: Editing Audio-Driven Talking Radiance Fields with Instructions

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis

Talking Face Generation With Audio-Deduced Emotional Landmarks

Embedded Representation Learning Network for Animating Styled Video Portrait

Audio-Driven Emotional Video Portraits