Abstract:In recent years, audio-driven 3D facial animation has gained significant attention, particularly in applications such as virtual reality, gaming, and video conferencing. However, accurately modeling the intricate and subtle dynamics of facial expressions remains a challenge. Most existing studies approach the facial animation task as a single regression problem, which often fail to capture the intrinsic inter-modal relationship between speech signals and 3D facial animation and overlook their inherent consistency. Moreover, due to the limited availability of 3D-audio-visual datasets, approaches learning with small-size samples have poor generalizability that decreases the performance. To address these issues, in this study, we propose a cross-modal dual-learning framework, termed DualTalker, aiming at improving data usage efficiency as well as relating cross-modal dependencies. The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components. Our joint training framework facilitates more efficient data usage by leveraging information from both tasks and explicitly capitalizing on the complementary relationship between facial motion and audio to improve performance. Furthermore, we introduce an auxiliary cross-modal consistency loss to mitigate the potential over-smoothing underlying the cross-modal complementary representations, enhancing the mapping of subtle facial expression dynamics. Through extensive experiments and a perceptual user study conducted on the VOCA and BIWI datasets, we demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. We have made our code and video demonstrations available at <a class="link-external link-https" href="https://github.com/sabrina-su/iadf.git" rel="external noopener nofollow">this https URL</a>.

Building Digital Human

ViDA-MAN: Visual Dialog with Digital Humans

An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation

Digital Human Intelligent Interaction System Based on Multimodal Pre-training Mode

Text to Avatar in Multi-modal Human Computer Interface

Generation of virtual digital human for customer service industry

Body of Her: A Preliminary Study on End-to-End Humanoid Agent

Human-Computer Interaction System: A Survey of Talking-Head Generation

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Digital Ventriloquism: Giving Voice to Everyday Objects

Digital Life Project: Autonomous 3D Characters with Social Intelligence

From Talking Head To Singing Head: A Significant Enhancement For More Natural Human Computer Interaction

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

Digital Avatars: Framework Development and Their Evaluation

Real-time Ultrasound-enhanced Multimodal Imaging of Tongue using 3D Printable Stabilizer System: A Deep Learning Approach

DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation

No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation

SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks

Breathing Life into Faces: Speech-driven 3D Facial Animation with Natural Head Pose and Detailed Shape