Abstract:In recent years, audio-driven 3D facial animation has gained significant attention, particularly in applications such as virtual reality, gaming, and video conferencing. However, accurately modeling the intricate and subtle dynamics of facial expressions remains a challenge. Most existing studies approach the facial animation task as a single regression problem, which often fail to capture the intrinsic inter-modal relationship between speech signals and 3D facial animation and overlook their inherent consistency. Moreover, due to the limited availability of 3D-audio-visual datasets, approaches learning with small-size samples have poor generalizability that decreases the performance. To address these issues, in this study, we propose a cross-modal dual-learning framework, termed DualTalker, aiming at improving data usage efficiency as well as relating cross-modal dependencies. The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components. Our joint training framework facilitates more efficient data usage by leveraging information from both tasks and explicitly capitalizing on the complementary relationship between facial motion and audio to improve performance. Furthermore, we introduce an auxiliary cross-modal consistency loss to mitigate the potential over-smoothing underlying the cross-modal complementary representations, enhancing the mapping of subtle facial expression dynamics. Through extensive experiments and a perceptual user study conducted on the VOCA and BIWI datasets, we demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. We have made our code and video demonstrations available at <a class="link-external link-https" href="https://github.com/sabrina-su/iadf.git" rel="external noopener nofollow">this https URL</a>.

Taking a Part for the Whole: An Archetype-agnostic Framework for Voice-Face Association

Hearing like Seeing: Improving Voice-Face Interactions and Associations via Adversarial Deep Semantic Matching Network

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Learning Discriminative Joint Embeddings for Efficient Face and Voice Association.

Hearing Like Seeing

Detach and Enhance: Learning Disentangled Cross-modal Latent Representation for Efficient Face-Voice Association and Matching

Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder

Voice-Face Cross-modal Matching and Retrieval: A Benchmark

Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association

Designing One Unified Framework for High-Fidelity Face Reenactment and Swapping

An Efficient Momentum Framework for Face-Voice Association Learning.

Accent Recognition with Hybrid Phonetic Features

Exploring Robust Face-Voice Matching in Multilingual Environments

EFT: Expert Fusion Transformer for Voice-Face Association Learning.

Facial Affect Recognition based on Multi Architecture Encoder and Feature Fusion for the ABAW7 Challenge

A Multi-Layer Fusion-Based Facial Expression Recognition Approach with Optimal Weighted AUs

Attribute-Guided Cross-Modal Interaction and Enhancement for Audio-Visual Matching

DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation

Cross-Modal Face Matching: Tackling Visual Abstraction Using Fine-Grained Attributes

Name-face Association with Web Facial Image Supervision

Multimodal Fusion for Talking Face Generation Utilizing Speech-related Facial Action Units