JoyHallo: Digital human model for Mandarin

Sheng Shi,Xuyang Cao,Jun Zhao,Guoxin Wang
2024-09-20
Abstract:In audio-driven video generation, creating Mandarin videos presents significant challenges. Collecting comprehensive Mandarin datasets is difficult, and the complex lip movements in Mandarin further complicate model training compared to English. In this study, we collected 29 hours of Mandarin speech video from JD Health International Inc. employees, resulting in the jdh-Hallo dataset. This dataset includes a diverse range of ages and speaking styles, encompassing both conversational and specialized medical topics. To adapt the JoyHallo model for Mandarin, we employed the Chinese wav2vec2 model for audio feature embedding. A semi-decoupled structure is proposed to capture inter-feature relationships among lip, expression, and pose features. This integration not only improves information utilization efficiency but also accelerates inference speed by 14.3%. Notably, JoyHallo maintains its strong ability to generate English videos, demonstrating excellent cross-language generation capabilities. The code and models are available at <a class="link-external link-https" href="https://jdh-algo.github.io/JoyHallo" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the challenges of generating Mandarin videos in the task of audio-driven video generation. Specifically, the research team explored the following two main issues: 1. **Lack of high-quality Mandarin datasets**: Compared to English, it is more difficult to collect comprehensive and high-quality Mandarin datasets. 2. **Complexity of Mandarin lip movements**: Mandarin lip movements are more complex than those of English, making model training more challenging. To tackle these challenges, the research team collected 29 hours of Mandarin speech video data from employees of JD Health International Inc., creating a dataset named jdh-Hallo. Additionally, they proposed the JoyHallo model, which utilizes the Chinese wav2vec2 model for audio feature embedding and introduces a semi-decoupled structure to improve the accuracy of lip movement prediction. This approach not only enhances information utilization but also accelerates inference speed (by 14.3%). Experimental results show that JoyHallo performs excellently in Mandarin video generation while maintaining strong cross-language generation capabilities, i.e., generating high-quality English videos simultaneously. This work has made significant progress in the field of audio-driven video generation, providing a robust solution to the unique challenges of Mandarin video generation.