MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset

Kim Sung-Bin,Lee Chae-Yeon,Gihun Son,Oh Hyun-Bin,Janghoon Ju,Suekyeong Nam,Tae-Hyun Oh

2024-06-20

Abstract:Recent studies in speech-driven 3D talking head generation have achieved convincing results in verbal articulations. However, generating accurate lip-syncs degrades when applied to input speech in other languages, possibly due to the lack of datasets covering a broad spectrum of facial movements across languages. In this work, we introduce a novel task to generate 3D talking heads from speeches of diverse languages. We collect a new multilingual 2D video dataset comprising over 420 hours of talking videos in 20 languages. With our proposed dataset, we present a multilingually enhanced model that incorporates language-specific style embeddings, enabling it to capture the unique mouth movements associated with each language. Additionally, we present a metric for assessing lip-sync accuracy in multilingual settings. We demonstrate that training a 3D talking head model with our proposed dataset significantly enhances its multilingual performance. Codes and datasets are available at <a class="link-external link-https" href="https://multi-talk.github.io/" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Graphics

What problem does this paper attempt to address?

This paper proposes a solution to the problem of generating multilingual 3D talking avatars. Existing research has made progress in voice-driven 3D talking avatar generation, especially in lip synchronization. However, the performance declines when the input voice is non-English, potentially due to the lack of a diverse facial movement database covering multiple languages. To address this, the paper introduces a new task of generating 3D talking avatars from voices in different languages and constructs a multilingual video dataset (MultiTalk) consisting of 20 languages and over 420 hours of 2D videos. They develop an enhancement model using this dataset, which leverages language-specific style embeddings to capture language-specific mouth movements. Additionally, the paper introduces a new metric, Audio-Visual Lip Reading Readability (AVLR), to evaluate the accuracy of multilingual lip synchronization. An audio-visual speech recognition model pre-trained on data is used to evaluate the performance of 3D talking avatars on multilingual speech. Experimental results demonstrate that the model trained on the MultiTalk dataset outperforms previous works in handling multilingual 3D talking avatars. The main contributions of the paper include proposing the task of multilingual 3D talking avatar generation, constructing the MultiTalk dataset, and establishing a baseline model, MultiTalk, which can generate accurate and expressive 3D facial movements from multilingual speech.

MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset

PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo Multi-modal Features

Audio-driven Talking Face Video Generation with Natural Head Pose

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert

DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

Multimodal Learning for Temporally Coherent Talking Face Generation with Articulator Synergy

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

MMHead: Towards Fine-grained Multi-modal 3D Facial Animation

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation

StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model

Audio-Driven 3D Facial Animation from In-the-Wild Videos

Towards Realistic Visual Dubbing with Heterogeneous Sources

LaughTalk: Expressive 3D Talking Head Generation with Laughter

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation