Abstract:Generating synchronized and natural lip movement with speech is one of the most important tasks in creating realistic virtual characters. In this paper, we present a combined deep neural network of one-dimensional convolutions and LSTM to generate vertex displacement of a 3D template face model from variable-length speech input. The motion of the lower part of the face, which is represented by the vertex movement of 3D lip shapes, is consistent with the input speech. In order to enhance the robustness of the network to different sound signals, we adapt a trained speech recognition model to extract speech feature, and a velocity loss term is adopted to reduce the jitter of generated facial animation. We recorded a series of videos of a Chinese adult speaking Mandarin and created a new speech-animation dataset to compensate the lack of such public data. Qualitative and quantitative evaluations indicate that our model is able to generate smooth and natural lip movements synchronized with speech.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to generate lip movements that are synchronized with speech and natural, in order to create realistic virtual characters. Specifically, the author proposes a deep neural network model that combines one - dimensional convolutional neural network (CNN) and long - short - term memory network (LSTM), which is used to generate vertex displacements of a 3D template face model from variable - length speech inputs. Through this method, the model can generate lower - face movements consistent with the input speech, especially the vertex movements of the 3D lip shape. ### Main Contributions 1. **Dataset Creation**: Due to the lack of publicly available Chinese speech - animation datasets, the author recorded a series of videos and created a new Chinese speech - animation dataset. 2. **Feature Extraction**: In order to improve the model's robustness to different sound signals, the author uses a pre - trained speech recognition model (such as DeepSpeech) to extract speech features. 3. **Network Architecture**: The proposed network combines one - dimensional convolutional layers and LSTM blocks and can generate smooth and natural facial animations. 4. **Loss Function**: A velocity loss term is introduced to reduce the jitter phenomenon in the generated facial animations. ### Specific Methods - **Data Collection**: Recorded several hours of Chinese adult speech videos using a camera at a frame rate of 60 frames per second and recorded the synchronized sound signals. - **3D Model Generation**: Used the Faceware software tool to convert the recorded videos into 3D animations and generated 3D face data. - **Network Architecture**: The network consists of two parts: - The first half includes two one - dimensional convolutional layers, four unidirectional LSTM blocks and two fully - connected layers, which are used to convert the extracted speech features into low - dimensional embeddings. - The second half is a decoder, which consists of a fully - connected layer with a linear activation function, mapping the embedding to the high - dimensional 3D vertex displacement space. - **Loss Function**: The total loss function includes reconstruction loss and velocity loss, which are respectively used to constrain the gap between the predicted vertex coordinates and the real values and reduce vertex jitter. ### Experimental Results - **Qualitative Evaluation**: Although the training data only contains male voices, the model also shows good generalization ability when processing female and synthetic voices. - **Quantitative Evaluation**: The performance of different models was compared through two indicators, position error and velocity error. The experimental results show that the model combining convolutional layers and LSTM blocks is superior to the model using LSTM alone in terms of motion accuracy, and the introduction of the velocity loss term significantly reduces lip jitter and makes the animation transition smoother. ### Conclusion The author successfully solves the problem of generating 3D facial animations that are synchronized with speech and natural by creating a new Chinese speech - animation dataset and proposing a deep neural network model that combines convolutional layers and LSTM blocks. The model shows high robustness when processing different audio sources, and in the future, the realism and naturalness of facial animations can be further improved by adding movement data of the upper face (such as eyebrows and eyes).

A Novel Speech-Driven Lip-Sync Model with CNN and LSTM

A Speech-Driven 3-D Lip Synthesis with Realistic Dynamics in Mandarin Chinese

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

Audio-driven Talking Face Video Generation with Natural Head Pose

A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News Anchors

VisemeNet: Audio-Driven Animator-Centric Speech Animation

Realistic Speech-Driven Talking Video Generation with Personalized Pose

Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task Learning

Breathing Life into Faces: Speech-driven 3D Facial Animation with Natural Head Pose and Detailed Shape

LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation

Real-Time Lip Sync for Live 2D Animation

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

Real-time Lip Synchronization Based on Hidden Markov Models

Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

Learning Speech-driven 3D Conversational Gestures from Video

Real-time speech-driven lip synchronization

Speech-driven Facial Animation with Spectral Gathering and Temporal Attention.