A Novel Speech-Driven Lip-Sync Model with CNN and LSTM

Xiaohong Li,Xiang Wang,Kai Wang,Shiguo Lian
DOI: https://doi.org/10.1109/CISP-BMEI53629.2021.9624360
2022-05-02
Abstract:Generating synchronized and natural lip movement with speech is one of the most important tasks in creating realistic virtual characters. In this paper, we present a combined deep neural network of one-dimensional convolutions and LSTM to generate vertex displacement of a 3D template face model from variable-length speech input. The motion of the lower part of the face, which is represented by the vertex movement of 3D lip shapes, is consistent with the input speech. In order to enhance the robustness of the network to different sound signals, we adapt a trained speech recognition model to extract speech feature, and a velocity loss term is adopted to reduce the jitter of generated facial animation. We recorded a series of videos of a Chinese adult speaking Mandarin and created a new speech-animation dataset to compensate the lack of such public data. Qualitative and quantitative evaluations indicate that our model is able to generate smooth and natural lip movements synchronized with speech.
Sound,Artificial Intelligence,Computer Vision and Pattern Recognition,Graphics,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to generate lip movements that are synchronized with speech and natural, in order to create realistic virtual characters. Specifically, the author proposes a deep neural network model that combines one - dimensional convolutional neural network (CNN) and long - short - term memory network (LSTM), which is used to generate vertex displacements of a 3D template face model from variable - length speech inputs. Through this method, the model can generate lower - face movements consistent with the input speech, especially the vertex movements of the 3D lip shape. ### Main Contributions 1. **Dataset Creation**: Due to the lack of publicly available Chinese speech - animation datasets, the author recorded a series of videos and created a new Chinese speech - animation dataset. 2. **Feature Extraction**: In order to improve the model's robustness to different sound signals, the author uses a pre - trained speech recognition model (such as DeepSpeech) to extract speech features. 3. **Network Architecture**: The proposed network combines one - dimensional convolutional layers and LSTM blocks and can generate smooth and natural facial animations. 4. **Loss Function**: A velocity loss term is introduced to reduce the jitter phenomenon in the generated facial animations. ### Specific Methods - **Data Collection**: Recorded several hours of Chinese adult speech videos using a camera at a frame rate of 60 frames per second and recorded the synchronized sound signals. - **3D Model Generation**: Used the Faceware software tool to convert the recorded videos into 3D animations and generated 3D face data. - **Network Architecture**: The network consists of two parts: - The first half includes two one - dimensional convolutional layers, four unidirectional LSTM blocks and two fully - connected layers, which are used to convert the extracted speech features into low - dimensional embeddings. - The second half is a decoder, which consists of a fully - connected layer with a linear activation function, mapping the embedding to the high - dimensional 3D vertex displacement space. - **Loss Function**: The total loss function includes reconstruction loss and velocity loss, which are respectively used to constrain the gap between the predicted vertex coordinates and the real values and reduce vertex jitter. ### Experimental Results - **Qualitative Evaluation**: Although the training data only contains male voices, the model also shows good generalization ability when processing female and synthetic voices. - **Quantitative Evaluation**: The performance of different models was compared through two indicators, position error and velocity error. The experimental results show that the model combining convolutional layers and LSTM blocks is superior to the model using LSTM alone in terms of motion accuracy, and the introduction of the velocity loss term significantly reduces lip jitter and makes the animation transition smoother. ### Conclusion The author successfully solves the problem of generating 3D facial animations that are synchronized with speech and natural by creating a new Chinese speech - animation dataset and proposing a deep neural network model that combines convolutional layers and LSTM blocks. The model shows high robustness when processing different audio sources, and in the future, the realism and naturalness of facial animations can be further improved by adding movement data of the upper face (such as eyebrows and eyes).