Abstract:Synthesizing realistic videos according to a given speech is still an open challenge. Previous works have been plagued by issues such as inaccurate lip shape generation and poor image quality. The key reason is that only motions and appearances on limited facial areas (e.g., lip area) are mainly driven by the input speech. Therefore, directly learning a mapping function from speech to the entire head image is prone to ambiguity, particularly when using a short video for training. We thus propose a decomposition-synthesis-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance to facilitate effective learning from limited training data, resulting in the generation of natural-looking videos. First, given a fixed head pose (i.e., canonical space), we present a speech-driven implicit model for lip image generation which concentrates on learning speech-sensitive motion and appearance. Next, to model the major speech-insensitive motion (i.e., head movement), we introduce a geometry-aware mutual explicit mapping (GAMEM) module that establishes geometric mappings between different head poses. This allows us to paste generated lip images at the canonical space onto head images with arbitrary poses and synthesize talking videos with natural head movements. In addition, a Blend-Net and a contrastive sync loss are introduced to enhance the overall synthesis performance. Quantitative and qualitative results on three benchmarks demonstrate that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization. Code: https://github.com/CVMI-Lab/Speech2Lip.

RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

Learning Speaker-specific Lip-to-Speech Generation

Towards Accurate Lip-to-Speech Synthesis in-the-Wild

Intelligible Lip-to-Speech Synthesis with Speech Units

LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild

SVTS: Scalable Video-to-Speech Synthesis

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

FlexLip: A Controllable Text-to-Lip System

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Lip2AudSpec: Speech reconstruction from silent lip movements video

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

Lip-to-Speech Synthesis in the Wild with Multi-task Learning

Rep2wav: Noise Robust text-to-speech Using self-supervised representations

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis

Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos