JoyHallo: Digital human model for Mandarin

Sheng Shi,Xuyang Cao,Jun Zhao,Guoxin Wang

2024-09-20

Abstract:In audio-driven video generation, creating Mandarin videos presents significant challenges. Collecting comprehensive Mandarin datasets is difficult, and the complex lip movements in Mandarin further complicate model training compared to English. In this study, we collected 29 hours of Mandarin speech video from JD Health International Inc. employees, resulting in the jdh-Hallo dataset. This dataset includes a diverse range of ages and speaking styles, encompassing both conversational and specialized medical topics. To adapt the JoyHallo model for Mandarin, we employed the Chinese wav2vec2 model for audio feature embedding. A semi-decoupled structure is proposed to capture inter-feature relationships among lip, expression, and pose features. This integration not only improves information utilization efficiency but also accelerates inference speed by 14.3%. Notably, JoyHallo maintains its strong ability to generate English videos, demonstrating excellent cross-language generation capabilities. The code and models are available at <a class="link-external link-https" href="https://jdh-algo.github.io/JoyHallo" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily addresses the challenges of generating Mandarin videos in the task of audio-driven video generation. Specifically, the research team explored the following two main issues: 1. **Lack of high-quality Mandarin datasets**: Compared to English, it is more difficult to collect comprehensive and high-quality Mandarin datasets. 2. **Complexity of Mandarin lip movements**: Mandarin lip movements are more complex than those of English, making model training more challenging. To tackle these challenges, the research team collected 29 hours of Mandarin speech video data from employees of JD Health International Inc., creating a dataset named jdh-Hallo. Additionally, they proposed the JoyHallo model, which utilizes the Chinese wav2vec2 model for audio feature embedding and introduces a semi-decoupled structure to improve the accuracy of lip movement prediction. This approach not only enhances information utilization but also accelerates inference speed (by 14.3%). Experimental results show that JoyHallo performs excellently in Mandarin video generation while maintaining strong cross-language generation capabilities, i.e., generating high-quality English videos simultaneously. This work has made significant progress in the field of audio-driven video generation, providing a robust solution to the unique challenges of Mandarin video generation.

JoyHallo: Digital human model for Mandarin

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

CNAMD Corpus: A Chinese Natural Audiovisual Multimodal Database of Conversations for Social Interactive Agents

Audio-driven Talking Face Video Generation with Natural Head Pose

A Speech-Driven 3-D Lip Synthesis with Realistic Dynamics in Mandarin Chinese

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

A Novel Speech-Driven Lip-Sync Model with CNN and LSTM

JIANG: Chinese Open Foundation Language Model

EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding

Learning to Dub Movies Via Hierarchical Prosody Models.

Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information

Practice of the conformer enhanced AUDIO-VISUAL HUBERT on Mandarin and English

AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation

Application of deep learning in Mandarin Chinese lip-reading recognition