Abstract:Recent years have witnessed great progress in audio-driven talking head animation. Among these methods, the 3D-based ones better preserve the 3D consistency of the generated head and produce more natural results compared with 2D-based approaches. However, most 3D-based methods employ 3D morphable face models as the intermediate representation and involve multi-stage training, which may lead to error accumulation. To alleviate this problem, in this paper, we propose a fully end-to-end talking head animation method, which implicitly grasps the 3D structures by learning a conditional Neural Radiance Field (NeRF). As NeRF has proven to be an effective tool for 3D modeling, one can learn dynamic neural radiance fields conditioned on audio signals for talking head synthesis. Furthermore, we argue that audio signals cannot fully drive a lifelike talking head. When people are talking, they usually show many spontaneous facial movements like blinks and brow movements, which makes talkers natural and real. These movements cannot be fully driven by the audio signals since they are highly unrelated to the audio. Therefore, we incorporate motion information as another driving factor and develop an audio-motion dual-driven NeRF model to take a step toward more lifelike talking head synthesis. On this basis, as audio and motion mainly affect different regions of the human face, we propose a Spatially-adaptive Dual-driven NeRF (SD-NeRF), which fuses these two driven factors with a spatially-adaptive cross-attention mechanism. Quantitative and qualitative results demonstrate that, with finer facial controls, our method produces more realistic talking head videos compared with existing advanced works. For more video results, including the multi-view animation and cross audio-driven results, please refer to our demonstration video https://cloud.tsinghua.edu.cn/f/7ebd663951e5403da4a5/ .

Sem-Avatar: Semantic Controlled Neural Field for High-Fidelity Audio Driven Avatar.

Audio-driven Talking Face Video Generation with Natural Head Pose

READ Avatars: Realistic Emotion-controllable Audio Driven Avatars

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis

Video-driven state-aware facial animation

Real-time Conversion from a Single 2D Face Image to a 3D Text-Driven Emotive Audio-Visual Avatar

Audio-driven facial animation by joint end-to-end learning of pose and emotion

SD-NeRF: Towards Lifelike Talking Head Animation Via Spatially-Adaptive Dual-Driven NeRFs

Emotional Speech-Driven Animation with Content-Emotion Disentanglement

Neural Point-based Volumetric Avatar: Surface-guided Neural Points for Efficient and Photorealistic Volumetric Head Avatar

Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

HAvatar: High-fidelity Head Avatar via Facial Model Conditioned Neural Radiance Field

S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

High-Fidelity 3D Head Avatars Reconstruction through Spatially-Varying Expression Conditioned Neural Radiance Field

AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars

Universal Facial Encoding of Codec Avatars from VR Headsets

EmoFace: Audio-driven Emotional 3D Face Animation

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

High-fidelity facial and speech animation for VR HMDs

BakedAvatar: Baking Neural Fields for Real-Time Head Avatar Synthesis