Abstract:In this work, we propose a method to transform a speaker’s speech information into a target character’s talking video; the method could make the mouth shape synchronization, expression, and body posture more realistic in the synthesized speaker video. This is a challenging task because changes of mouth shape and posture are coupled with audio semantic information. The model training is difficult to converge, and the model effect is unstable in complex scenes. Existing speech-driven speaker methods cannot solve this problem well. The method proposed in this paper first generates the sequence of key points of the speaker’s face and body postures from the audio signal in real time and then visualizes these key points as a series of two-dimensional skeleton images. Subsequently, we generate the final real speaker video through the video generation network. We take a random sampling of audio clips, encode audio contents and temporal correlations using a more effective network structure, and optimize and iterate network outputs using differential loss and attitude perception loss, so as to obtain a smoother pose key-point sequence and better performance. In addition, by inserting a specified action frame into the synthesized human pose sequence window, action poses of the synthesized speaker are enriched, making the synthesis effect more realistic and natural. Then, the final speaker video is generated by the obtained gesture key points through the video generation network. In order to generate realistic and high-resolution pose detail videos, we insert a local attention mechanism into the key point network of the generated pose sequence and give higher attention to the local details of the characters through spatial weight masks. In order to verify the effectiveness of the proposed method, we used the objective evaluation index NME and user subjective evaluation methods, respectively. Experiment results showed that our method could vividly use audio contentsto generate corresponding speaker videos, and its lip-matching accuracy and expression postures are better than those of previous work. Compared with existing methods in the NME index and user subjective evaluation, our method showed better results.

StyleLipSync: Style-based Personalized Lip-sync Video Generation

StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator

StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

Style-Preserving Lip Sync via Audio-Aware Style Reference

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Audio-driven Talking Face Video Generation with Natural Head Pose

Expressive Talking Head Video Encoding in StyleGAN2 Latent-Space

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles

Realistic Speech-Driven Talking Video Generation with Personalized Pose

MILG: Realistic Lip-Sync Video Generation with Audio-Modulated Image Inpainting

SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task Learning

Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

Say Anything with Any Style

PVP: Personalized Video Prior for Editable Dynamic Portraits using StyleGAN

One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2

SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space

PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model