Abstract:This paper proposes a novel lip-reading driven deep learning framework for speech enhancement. The approach leverages the complementary strengths of both deep learning and analytical acoustic modeling (filtering-based approach) as compared to benchmark approaches that rely only on deep learning. The proposed audio-visual (AV) speech enhancement framework operates at two levels. In the first level, a novel deep learning based lip-reading regression model is employed. In the second level, lip-reading approximated clean-audio features are exploited, using an enhanced, visually-derived Wiener filter (EVWF), for estimating the clean audio power spectrum. Specifically, a stacked long-short-term memory (LSTM) based lip-reading regression model is designed for estimating the clean audio features using only temporal visual features (i.e., lip reading), by considering a range of prior visual frames. For clean speech spectrum estimation, a new filterbank-domain EVWF is formulated, which exploits the estimated speech features. The EVWF is compared with conventional spectral subtraction and log-minimum mean-square error methods using both ideal AV mapping and LSTM driven AV mapping approaches. The potential of the proposed AV speech enhancement framework is evaluated under four different dynamic real-world scenarios [cafe, street junction, public transport, and pedestrian area] at different SNR levels (ranging from low to high SNRs) using benchmark grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality is used to evaluate the quality of restored speech. For subjective testing, the standard mean-opinion-score method is used with inferential statistics. Comparative simulation results demonstrate significant lip-reading and speech enhancement improvements in terms of both speech quality and speech intelligibility. Ongoing work is aimed at enhancing the accuracy and generalization capability of the deep learning driven lip-reading model, using contextual integration of AV cues, leading to context-aware, autonomous AV speech enhancement.

Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries

Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention

Lip Reading Based on 3D Face Modeling and Spatial Transformation Learning

Fine-grained sequence-to-sequence lip reading based on self-attention and self-distillation

Lip Reading using Deep Learning

Multi-Grained Spatio-temporal Modeling for Lip-reading

HMM-based Lip Reading with Stingy Residual 3D Convolution

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading

Learning the Relative Dynamic Features for Word-Level Lipreading

SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory

Sub-word Level Lip Reading With Visual Attention

Adaptive Semantic-Spatio-Temporal Graph Convolutional Network for Lip Reading

Lip-Reading Driven Deep Learning Approach for Speech Enhancement

TCS-LipNet: Temporal & Channel & Spatial Attention-Based Lip Reading Network

Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Collaborative Viseme Subword and End-to-end Modeling for Word-level Lip Reading

Enhancing Lip Reading: A Deep Learning Approach with CNN and RNN Integration

Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading