MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset

Kim Sung-Bin,Lee Chae-Yeon,Gihun Son,Oh Hyun-Bin,Janghoon Ju,Suekyeong Nam,Tae-Hyun Oh
2024-06-20
Abstract:Recent studies in speech-driven 3D talking head generation have achieved convincing results in verbal articulations. However, generating accurate lip-syncs degrades when applied to input speech in other languages, possibly due to the lack of datasets covering a broad spectrum of facial movements across languages. In this work, we introduce a novel task to generate 3D talking heads from speeches of diverse languages. We collect a new multilingual 2D video dataset comprising over 420 hours of talking videos in 20 languages. With our proposed dataset, we present a multilingually enhanced model that incorporates language-specific style embeddings, enabling it to capture the unique mouth movements associated with each language. Additionally, we present a metric for assessing lip-sync accuracy in multilingual settings. We demonstrate that training a 3D talking head model with our proposed dataset significantly enhances its multilingual performance. Codes and datasets are available at <a class="link-external link-https" href="https://multi-talk.github.io/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
This paper proposes a solution to the problem of generating multilingual 3D talking avatars. Existing research has made progress in voice-driven 3D talking avatar generation, especially in lip synchronization. However, the performance declines when the input voice is non-English, potentially due to the lack of a diverse facial movement database covering multiple languages. To address this, the paper introduces a new task of generating 3D talking avatars from voices in different languages and constructs a multilingual video dataset (MultiTalk) consisting of 20 languages and over 420 hours of 2D videos. They develop an enhancement model using this dataset, which leverages language-specific style embeddings to capture language-specific mouth movements. Additionally, the paper introduces a new metric, Audio-Visual Lip Reading Readability (AVLR), to evaluate the accuracy of multilingual lip synchronization. An audio-visual speech recognition model pre-trained on data is used to evaluate the performance of 3D talking avatars on multilingual speech. Experimental results demonstrate that the model trained on the MultiTalk dataset outperforms previous works in handling multilingual 3D talking avatars. The main contributions of the paper include proposing the task of multilingual 3D talking avatar generation, constructing the MultiTalk dataset, and establishing a baseline model, MultiTalk, which can generate accurate and expressive 3D facial movements from multilingual speech.