Abstract:Synthesizing realistic videos according to a given speech is still an open challenge. Previous works have been plagued by issues such as inaccurate lip shape generation and poor image quality. The key reason is that only motions and appearances on limited facial areas (e.g., lip area) are mainly driven by the input speech. Therefore, directly learning a mapping function from speech to the entire head image is prone to ambiguity, particularly when using a short video for training. We thus propose a decomposition-synthesis-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance to facilitate effective learning from limited training data, resulting in the generation of natural-looking videos. First, given a fixed head pose (i.e., canonical space), we present a speech-driven implicit model for lip image generation which concentrates on learning speech-sensitive motion and appearance. Next, to model the major speech-insensitive motion (i.e., head movement), we introduce a geometry-aware mutual explicit mapping (GAMEM) module that establishes geometric mappings between different head poses. This allows us to paste generated lip images at the canonical space onto head images with arbitrary poses and synthesize talking videos with natural head movements. In addition, a Blend-Net and a contrastive sync loss are introduced to enhance the overall synthesis performance. Quantitative and qualitative results on three benchmarks demonstrate that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization. Code: https://github.com/CVMI-Lab/Speech2Lip.

Lip Synchronization Model For Sinhala Language Using Machine Learning

A Novel Speech-Driven Lip-Sync Model with CNN and LSTM

VisemeNet: Audio-Driven Animator-Centric Speech Animation

A Speech-Driven 3-D Lip Synthesis with Realistic Dynamics in Mandarin Chinese

Real-Time Lip Sync for Live 2D Animation

SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task Learning

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

Learning Audio-Driven Viseme Dynamics for 3D Face Animation

A Novel Lip Synchronization Approach for Games and Virtual Environments

Lip syncing method for realistic expressive 3D face model

LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation

Audio-driven Talking Face Video Generation with Natural Head Pose

LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

Learning Speaker-specific Lip-to-Speech Generation

VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild