Abstract:Lip reading, the process of interpreting speech by visually observing the movements of the lips, has emerged as a critical area of research with applications spanning communication aids for the hearing impaired, silent speech interfaces, and enhanced human-computer interaction. This paper reviews recent advancements in lip reading technologies, focusing on the integration of machine learning and computer vision techniques. We explore state-of-the-art methods including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformer-based models that have significantly improved the accuracy and robustness of lip reading systems. The study highlights the importance of large annotated datasets, such as LipNet and LRW, which have facilitated the training of deep learning models. Additionally, we examine multimodal approaches that combine visual information with audio signals to enhance performance, especially in noisy environments. Despite substantial progress, challenges remain in addressing speaker variability, low resolution, and real-time processing. Future research directions are discussed, emphasizing the need for more diverse datasets, improved model generalization, and real-world application testing. This comprehensive review underscores the potential of advanced lip reading technologies to revolutionize communication accessibility and human-computer interaction. This paper presents the method for Vision based Lip Reading system that uses convolutional neural network (CNN) with attention-based Long Short-Term Memory (LSTM). The dataset includes video clips pronouncing words sentence. The pretrained CNN is used for extracting features from pre- processed video frames which then are processed for learning temporal characteristics by LSTM. The SoftMax layer of architecture provides the result of lip reading. In the present work experiments are performed with two pre-trained models. The system provides 80% accuracy using Tensorflow and ensemble learning. Keywords— CNN; RNN; LSTM; Tensorflow; lip reading; deep learning

Cross-Attention Fusion of Visual and Geometric Features for Large Vocabulary Arabic Lipreading

Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip Reading

Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models

An arabic visual speech recognition framework with CNN and vision transformers for lipreading

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

LRWR: Large-Scale Benchmark for Lip Reading in Russian language

Advances and Challenges in Deep Lip Reading

Learning the Relative Dynamic Features for Word-Level Lipreading

Deep Audio-visual Speech Recognition

Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language

Visual Lip-Reading for Quranic Arabic Alphabets and Words Using Deep Learning

Lip Reading using Deep Learning

Visual Words for Automatic Lip-Reading

Lip Localization and Viseme Classification for Visual Speech Recognition

Sub-word Level Lip Reading With Visual Attention

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Lip Reading Using Various Deep Learning Models with Visual Turkish Data

Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition

An automatic lip reading for short sentences using deep learning nets

Lip Reading Sentences in the Wild