Abstract:Lip reading, the process of interpreting speech by visually observing the movements of the lips, has emerged as a critical area of research with applications spanning communication aids for the hearing impaired, silent speech interfaces, and enhanced human-computer interaction. This paper reviews recent advancements in lip reading technologies, focusing on the integration of machine learning and computer vision techniques. We explore state-of-the-art methods including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformer-based models that have significantly improved the accuracy and robustness of lip reading systems. The study highlights the importance of large annotated datasets, such as LipNet and LRW, which have facilitated the training of deep learning models. Additionally, we examine multimodal approaches that combine visual information with audio signals to enhance performance, especially in noisy environments. Despite substantial progress, challenges remain in addressing speaker variability, low resolution, and real-time processing. Future research directions are discussed, emphasizing the need for more diverse datasets, improved model generalization, and real-world application testing. This comprehensive review underscores the potential of advanced lip reading technologies to revolutionize communication accessibility and human-computer interaction. This paper presents the method for Vision based Lip Reading system that uses convolutional neural network (CNN) with attention-based Long Short-Term Memory (LSTM). The dataset includes video clips pronouncing words sentence. The pretrained CNN is used for extracting features from pre- processed video frames which then are processed for learning temporal characteristics by LSTM. The SoftMax layer of architecture provides the result of lip reading. In the present work experiments are performed with two pre-trained models. The system provides 80% accuracy using Tensorflow and ensemble learning. Keywords— CNN; RNN; LSTM; Tensorflow; lip reading; deep learning

Lip Reading Based on 3D Face Modeling and Spatial Transformation Learning

A Cross-Dimension Annotations Method for 3D Structural Facial Landmark Extraction

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Learn an Effective Lip Reading Model without Pains

Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer

Lip Reading using Deep Learning

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention

HMM-based Lip Reading with Stingy Residual 3D Convolution

Pathogenesis of avian flu H5N1 and SARS.

Multi-Grained Spatio-temporal Modeling for Lip-reading

Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading

Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries

Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert

Acoustic-Based Lip Reading for Mobile Devices: Dataset, Benchmark and a Self Distillation-Based Approach

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip Reading

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory