Abstract:In recent years, audio-driven 3D facial animation has gained significant attention, particularly in applications such as virtual reality, gaming, and video conferencing. However, accurately modeling the intricate and subtle dynamics of facial expressions remains a challenge. Most existing studies approach the facial animation task as a single regression problem, which often fail to capture the intrinsic inter-modal relationship between speech signals and 3D facial animation and overlook their inherent consistency. Moreover, due to the limited availability of 3D-audio-visual datasets, approaches learning with small-size samples have poor generalizability that decreases the performance. To address these issues, in this study, we propose a cross-modal dual-learning framework, termed DualTalker, aiming at improving data usage efficiency as well as relating cross-modal dependencies. The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components. Our joint training framework facilitates more efficient data usage by leveraging information from both tasks and explicitly capitalizing on the complementary relationship between facial motion and audio to improve performance. Furthermore, we introduce an auxiliary cross-modal consistency loss to mitigate the potential over-smoothing underlying the cross-modal complementary representations, enhancing the mapping of subtle facial expression dynamics. Through extensive experiments and a perceptual user study conducted on the VOCA and BIWI datasets, we demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. We have made our code and video demonstrations available at <a class="link-external link-https" href="https://github.com/sabrina-su/iadf.git" rel="external noopener nofollow">this https URL</a>.

Joint gaze-correction and beautification of DIBR-synthesized human face via dual sparse coding

Joint Gaze Correction and Face Beautification for Conference Video using Dual Sparsity Prior

Coupled Dictionary Learning for the Detail-Enhanced Synthesis of 3-D Facial Expressions

Video-driven state-aware facial animation

Chunk-wise Face Model Based Gaze Correction in Conversational Videos with Single Camera

An experimental facial synthesis system using graph cut and gradient domain fusion

Image-based facial sketch-to-photo synthesis via online coupled dictionary learning

Face beautification: Beyond makeup transfer

Eye gaze correction with stereovision for video-teleconferencing

Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation

A Data-Driven Approach for Facial Expression Retargeting in Video

IA-FaceS: A bidirectional method for semantic face editing

Robust Geometry and Reflectance Disentanglement for 3D Face Reconstruction from Sparse-view Images

Facial Depth Map Enhancement Via Neighbor Embedding.

Image based Face Sketch-to-Photo via Online Coupled Dictionary Learning

Dual In-painting Model for Unsupervised Gaze Correction and Animation in the Wild

Joint Sketch-Attribute Learning for Fine-Grained Face Synthesis.

DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation

Frontal face synthesis based on improved binocular stereo vision

A data-driven approach for facial expression synthesis in video

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation