Learn2Talk: 3D Talking Face Learns from 2D Talking Face

Yixiang Zhuang,Baoping Cheng,Yao Cheng,Yuntao Jin,Renshuai Liu,Chengyang Li,Xuan Cheng,Jing Liao,Juncong Lin
2024-04-19
Abstract:Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years. However, to the best of our knowledge, the research on 3D talking face does not go deeper as 2D talking face, in the aspect of lip-synchronization (lip-sync) and speech perception. To mind the gap between the two sub-fields, we propose a learning framework named Learn2Talk, which can construct a better 3D talking face network by exploiting two expertise points from the field of 2D talking face. Firstly, inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync between audio and 3D facial motion. Secondly, a teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network to yield more 3D vertex accuracy. Extensive experiments show the advantages of the proposed framework in terms of lip-sync, vertex accuracy and speech perception, compared with state-of-the-arts. Finally, we show two applications of the proposed framework: audio-visual speech recognition and speech-driven 3D Gaussian Splatting based avatar animation.
Computer Vision and Pattern Recognition,Graphics,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the gap between 3D talking faces and 2D talking faces in the field of speech-driven facial animation. Specifically, although 2D talking face methods have made significant progress in lip-sync and speech perception, the development of 3D talking face methods has been relatively lagging. The paper proposes a new framework called Learn2Talk, which improves 3D talking face networks by learning lip-sync and speech perception capabilities from 2D talking face methods. The main contributions include: 1. **Lip-sync Extension**: Extending SyncNet from the pixel domain to the 3D motion domain, proposing SyncNet3D, which is used as a discriminator during training to enhance lip-sync and as a metric during testing to evaluate the quality of synthesized 3D motion. 2. **Speech Perception Enhancement**: By using a lipreading constraint, the knowledge from 2D talking face methods is distilled into the audio-to-3D motion regression model, thereby improving the accuracy of predicting 3D facial motion at the lip vertices. 3. **Performance Surpassing State-of-the-Art**: Quantitative comparisons and extensive visual comparisons show that this method surpasses the current state-of-the-art on public datasets. 4. **Application Expansion**: This method can drive full-head virtual avatars constructed by 3D Gaussian Splatting, achieving the first speech-driven 3DGS-based virtual avatar animation. In summary, the paper aims to improve 3D talking face technology by combining the advantages of 2D talking faces, thereby achieving better results in lip-sync, vertex accuracy, and speech perception.