A Speech-Driven 3-D Lip Synthesis with Realistic Dynamics in Mandarin Chinese
Changwei Liang,Xiaosheng Pan,Jiangping Kong
DOI: https://doi.org/10.1145/3429889.3429904
2020-01-01
Abstract:In this paper, a new speech-driven lip synchronization method is developed, predicting the 3-D geometric shape of the lip without using speech recognition model in the visualization procedure, and can be trained and evaluated with realistic dynamics. Videos of Mandarin Chinese words are used. Speech signals are calculated into MFCC as audio features. 68-points facial landmarks are annotated from the corresponding videos through the prediction algorithm from the Dlib Library. Eos, a 3-D Morphable Face Model, is applied, using the facial landmarks, to predict the 3-D shape, where we can acquire 3-D landmarks. A machine-learning sequence-tagging model, averaged Structured Perceptron using Viterbi algorithm, is applied for modelling the direct prediction of labial parameters from the acoustic MFCC parameters. The 3-D labial area shape from the 'eos' prediction of a frame is morphed according to the predicted 3-D labial landmarks, forming the 3-D lip sequence, which can be plotted synchronically with the acoustic signal. In this 3-D lip synthesis, acoustic features and realistic lip shapes are directly mapped, where lip units and speech recognition are not applied, preserving more realistic articulatory or personality details; and the predicted geometric shapes are comparable with realistic dynamics, with the comparison indicating that this synthesis is of good effect.