Abstract:In recent years, audio-driven 3D facial animation has gained significant attention, particularly in applications such as virtual reality, gaming, and video conferencing. However, accurately modeling the intricate and subtle dynamics of facial expressions remains a challenge. Most existing studies approach the facial animation task as a single regression problem, which often fail to capture the intrinsic inter-modal relationship between speech signals and 3D facial animation and overlook their inherent consistency. Moreover, due to the limited availability of 3D-audio-visual datasets, approaches learning with small-size samples have poor generalizability that decreases the performance. To address these issues, in this study, we propose a cross-modal dual-learning framework, termed DualTalker, aiming at improving data usage efficiency as well as relating cross-modal dependencies. The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components. Our joint training framework facilitates more efficient data usage by leveraging information from both tasks and explicitly capitalizing on the complementary relationship between facial motion and audio to improve performance. Furthermore, we introduce an auxiliary cross-modal consistency loss to mitigate the potential over-smoothing underlying the cross-modal complementary representations, enhancing the mapping of subtle facial expression dynamics. Through extensive experiments and a perceptual user study conducted on the VOCA and BIWI datasets, we demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. We have made our code and video demonstrations available at <a class="link-external link-https" href="https://github.com/sabrina-su/iadf.git" rel="external noopener nofollow">this https URL</a>.

Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue State Tracking

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Diverse and Effective Synthetic Data Generation for Adaptable Zero-Shot Dialogue State Tracking

Adapting Text-based Dialogue State Tracker for Spoken Dialogues

Hybrid Dialogue State Tracking for Real World Human-to-Human Dialogues

OLISIA: a Cascade System for Spoken Dialogue State Tracking

Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?

KILDST: Effective Knowledge-Integrated Learning for Dialogue State Tracking using Gazetteer and Speaker Information

Enhancing Dialogue State Tracking Models through LLM-backed User-Agents Simulation

SynthDST: Synthetic Data is All You Need for Few-Shot Dialog State Tracking

DSTEA: Improving Dialogue State Tracking via Entity Adaptive Pre-training

Non-Autoregressive Dialog State Tracking

Using Deep-Q Network To Select Candidates From N-Best Speech Recognition Hypotheses For Enhancing Dialogue State Tracking

DiaSynth: Synthetic Dialogue Generation Framework for Low Resource Dialogue Applications

Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking

Evolvable Dialogue State Tracking for Statistical Dialogue Management.

Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation

Investigating the Effects of Large-Scale Pseudo-Stereo Data and Different Speech Foundation Model on Dialogue Generative Spoken Language Model