Abstract:This paper proposes a novel lip-reading driven deep learning framework for speech enhancement. The approach leverages the complementary strengths of both deep learning and analytical acoustic modeling (filtering-based approach) as compared to benchmark approaches that rely only on deep learning. The proposed audio-visual (AV) speech enhancement framework operates at two levels. In the first level, a novel deep learning based lip-reading regression model is employed. In the second level, lip-reading approximated clean-audio features are exploited, using an enhanced, visually-derived Wiener filter (EVWF), for estimating the clean audio power spectrum. Specifically, a stacked long-short-term memory (LSTM) based lip-reading regression model is designed for estimating the clean audio features using only temporal visual features (i.e., lip reading), by considering a range of prior visual frames. For clean speech spectrum estimation, a new filterbank-domain EVWF is formulated, which exploits the estimated speech features. The EVWF is compared with conventional spectral subtraction and log-minimum mean-square error methods using both ideal AV mapping and LSTM driven AV mapping approaches. The potential of the proposed AV speech enhancement framework is evaluated under four different dynamic real-world scenarios [cafe, street junction, public transport, and pedestrian area] at different SNR levels (ranging from low to high SNRs) using benchmark grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality is used to evaluate the quality of restored speech. For subjective testing, the standard mean-opinion-score method is used with inferential statistics. Comparative simulation results demonstrate significant lip-reading and speech enhancement improvements in terms of both speech quality and speech intelligibility. Ongoing work is aimed at enhancing the accuracy and generalization capability of the deep learning driven lip-reading model, using contextual integration of AV cues, leading to context-aware, autonomous AV speech enhancement.

Silent speech recognition using data augmentation based on a three-dimensional lip model

Lip Reading Based on 3D Face Modeling and Spatial Transformation Learning

Deep Audio-visual Speech Recognition

An automatic lip reading for short sentences using deep learning nets

End-to-End Deep Learning Speech Recognition Model for Silent Speech Challenge

[Development and evaluation of a deep learning algorithm for German word recognition from lip movements]

End-to-end Mispronunciation Detection with Simulated Error Distance

Speech-Section Extraction Using Lip Movement and Voice Information in Japanese

Efficient DNN Model for Word Lip-Reading

3D Convolutional Neural Networks Based Speaker Identification and Authentication.

Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Silenttalk: Lip Reading Through Ultrasonic Sensing on Mobile Phones

Automatic Lip reading for decimal digits using ResNet50 Model

Lip2AudSpec: Speech reconstruction from silent lip movements video

Adaptive Semantic-Spatio-Temporal Graph Convolutional Network for Lip Reading

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

Lip Movement Detection Using 3D Convolution and Resnet

Lip Reading using Deep Learning

Application of deep learning in Mandarin Chinese lip-reading recognition

Lip-Reading Driven Deep Learning Approach for Speech Enhancement