Abstract:In recent years, audio-driven 3D facial animation has gained significant attention, particularly in applications such as virtual reality, gaming, and video conferencing. However, accurately modeling the intricate and subtle dynamics of facial expressions remains a challenge. Most existing studies approach the facial animation task as a single regression problem, which often fail to capture the intrinsic inter-modal relationship between speech signals and 3D facial animation and overlook their inherent consistency. Moreover, due to the limited availability of 3D-audio-visual datasets, approaches learning with small-size samples have poor generalizability that decreases the performance. To address these issues, in this study, we propose a cross-modal dual-learning framework, termed DualTalker, aiming at improving data usage efficiency as well as relating cross-modal dependencies. The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components. Our joint training framework facilitates more efficient data usage by leveraging information from both tasks and explicitly capitalizing on the complementary relationship between facial motion and audio to improve performance. Furthermore, we introduce an auxiliary cross-modal consistency loss to mitigate the potential over-smoothing underlying the cross-modal complementary representations, enhancing the mapping of subtle facial expression dynamics. Through extensive experiments and a perceptual user study conducted on the VOCA and BIWI datasets, we demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. We have made our code and video demonstrations available at <a class="link-external link-https" href="https://github.com/sabrina-su/iadf.git" rel="external noopener nofollow">this https URL</a>.

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

Dyadic Interaction Modeling for Social Behavior Generation

Text-driven Visual Prosody Generation for Embodied Conversational Agents

Active Listener: Continuous Generation of Listener's Head Motion Response in Dyadic Interactions

Beyond Talking -- Generating Holistic 3D Human Dyadic Motion for Communication

DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation

Can Language Models Learn to Listen?

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

Emotional Listener Portrait: Realistic Listener Motion Simulation in Conversation

CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation

Audio-driven facial animation by joint end-to-end learning of pose and emotion

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

It Takes Two: Real-time Co-Speech Two-person's Interaction Generation via Reactive Auto-regressive Diffusion Model

Learn2Talk: 3D Talking Face Learns from 2D Talking Face

Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation.

Audio2Gestures: Generating Diverse Gestures From Audio

End-to-end Learning for 3D Facial Animation from Raw Waveforms of Speech

Predicting Personalized Head Movement From Short Video and Speech Signal

Video-audio Driven Real-Time Facial Animation.