Abstract:This paper proposes a novel lip-reading driven deep learning framework for speech enhancement. The approach leverages the complementary strengths of both deep learning and analytical acoustic modeling (filtering-based approach) as compared to benchmark approaches that rely only on deep learning. The proposed audio-visual (AV) speech enhancement framework operates at two levels. In the first level, a novel deep learning based lip-reading regression model is employed. In the second level, lip-reading approximated clean-audio features are exploited, using an enhanced, visually-derived Wiener filter (EVWF), for estimating the clean audio power spectrum. Specifically, a stacked long-short-term memory (LSTM) based lip-reading regression model is designed for estimating the clean audio features using only temporal visual features (i.e., lip reading), by considering a range of prior visual frames. For clean speech spectrum estimation, a new filterbank-domain EVWF is formulated, which exploits the estimated speech features. The EVWF is compared with conventional spectral subtraction and log-minimum mean-square error methods using both ideal AV mapping and LSTM driven AV mapping approaches. The potential of the proposed AV speech enhancement framework is evaluated under four different dynamic real-world scenarios [cafe, street junction, public transport, and pedestrian area] at different SNR levels (ranging from low to high SNRs) using benchmark grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality is used to evaluate the quality of restored speech. For subjective testing, the standard mean-opinion-score method is used with inferential statistics. Comparative simulation results demonstrate significant lip-reading and speech enhancement improvements in terms of both speech quality and speech intelligibility. Ongoing work is aimed at enhancing the accuracy and generalization capability of the deep learning driven lip-reading model, using contextual integration of AV cues, leading to context-aware, autonomous AV speech enhancement.

Lip Assistant: Visualize Speech For Hearing Impaired People In Multimedia Services

LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

Real-time speech-driven lip synchronization

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

Lip Viseme Analysis of Chinese Shaanxi Xi’an Dialect Visual Speech for Talking Head in Speech Assistant System

Lipper: Synthesizing Thy Speech using Multi-View Lipreading

A Speech-Driven 3-D Lip Synthesis with Realistic Dynamics in Mandarin Chinese

Improvement of Acoustic Models Fused with Lip Visual Information for Low-Resource Speech

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

A data-efficient and easy-to-use lip language interface based on wearable motion capture and speech movement reconstruction

Speaker-independent Lips and Tongue Visualization of Vowels

Visual Words for Automatic Lip-Reading

Lip-Reading Driven Deep Learning Approach for Speech Enhancement

Towards Accurate Lip-to-Speech Synthesis in-the-Wild

Electromyogram-Based Lip-Reading via Unobtrusive Dry Electrodes and Machine Learning Methods.

Silenttalk: Lip Reading Through Ultrasonic Sensing on Mobile Phones

SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task Learning