Abstract:In this work, we propose a method to transform a speaker’s speech information into a target character’s talking video; the method could make the mouth shape synchronization, expression, and body posture more realistic in the synthesized speaker video. This is a challenging task because changes of mouth shape and posture are coupled with audio semantic information. The model training is difficult to converge, and the model effect is unstable in complex scenes. Existing speech-driven speaker methods cannot solve this problem well. The method proposed in this paper first generates the sequence of key points of the speaker’s face and body postures from the audio signal in real time and then visualizes these key points as a series of two-dimensional skeleton images. Subsequently, we generate the final real speaker video through the video generation network. We take a random sampling of audio clips, encode audio contents and temporal correlations using a more effective network structure, and optimize and iterate network outputs using differential loss and attitude perception loss, so as to obtain a smoother pose key-point sequence and better performance. In addition, by inserting a specified action frame into the synthesized human pose sequence window, action poses of the synthesized speaker are enriched, making the synthesis effect more realistic and natural. Then, the final speaker video is generated by the obtained gesture key points through the video generation network. In order to generate realistic and high-resolution pose detail videos, we insert a local attention mechanism into the key point network of the generated pose sequence and give higher attention to the local details of the characters through spatial weight masks. In order to verify the effectiveness of the proposed method, we used the objective evaluation index NME and user subjective evaluation methods, respectively. Experiment results showed that our method could vividly use audio contentsto generate corresponding speaker videos, and its lip-matching accuracy and expression postures are better than those of previous work. Compared with existing methods in the NME index and user subjective evaluation, our method showed better results.

3D Visible Speech Animation Driven by Chinese Prosody Markup Language

3D Visible Speech Animation Driven by Prosody Text

Text-driven Visual Prosody Generation for Embodied Conversational Agents

3D Facial Animation from Chinese Text.

A Speech-Driven 3-D Lip Synthesis with Realistic Dynamics in Mandarin Chinese

Real-time Synthesis of Chinese Visua using MPEG-4 FAP Features in a

Real-time Synthesis of Chinese Visual Speech and Facial Expressions Using MPEG-4 FAP Features in a Three-Dimensional Avatar

Prosodic Chinese Sign Language Synthesis Driven by Speech

Breathing Life into Faces: Speech-driven 3D Facial Animation with Natural Head Pose and Detailed Shape

A Multimodal Approach of Generating 3D Human-Like Talking Agent.

3D Realistic Talking Face Co-Driven by Text and Speech

Learning Audio-Driven Viseme Dynamics for 3D Face Animation

An Emotional Text-Driven 3D Visual Pronunciation System for Mandarin Chinese

Head Movement Synthesis Based on Semantic and Prosodic Features for a Chinese Expressive Avatar

A Synthesis Method of Three-dimensional Facial Animation with Multiple Elements Blending Based on MPEG-4

Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

Realistic Speech-Driven Talking Video Generation with Personalized Pose

PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo Multi-modal Features

Synthesizing 3D Trump: Predicting and Visualizing the Relationship Between Text, Speech, and Articulatory Movements.

Animating a Chinese interactive virtual character

Head and Facial Gestures Synthesis Using PAD Model for an Expressive Talking Avatar