Abstract:Lip reading has received an increasing research interest in recent years due to the rapid development of deep learning and its widespread potential applications. One key point to obtain good performance for the lip reading task depends heavily on how effective the representation can be to capture the lip movement information and meanwhile to resist the noises resulted from the change of pose, lighting conditions, speaker's appearance and so on. Towards this target, we propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level to enhance the relations of the features with the speech content. On the one hand, we constraint the features generated at each time step to enable them carry a strong relation with the speech content by imposing the local mutual information maximization constraint (LMIM), leading to improvements over the model's ability to discover fine-grained lip movements and the fine-grained differences among words with similar pronunciation, such as ``spend'' and ``spending''. On the other hand, we introduce the mutual information maximization constraint on the global sequence's level (GMIM), to make the model be able to pay more attention to discriminate key frames related with the speech content, and less to various noises appeared in the speaking process. By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading. To verify this method, we evaluate on two large-scale benchmark. We perform a detailed analysis and comparison on several aspects, including the comparison of the LMIM and GMIM with the baseline, the visualization of the learned representation and so on. The results not only prove the effectiveness of the proposed method but also report new state-of-the-art performance on both the two benchmarks.

Three-Dimensional Joint Geometric-Physiologic Feature For Lip-Reading

Lip Reading Based on 3D Face Modeling and Spatial Transformation Learning

LipPass: Lip Reading-based User Authentication on Smartphones Leveraging Acoustic Signals.

Lip Reading-Based User Authentication Through Acoustic Sensing on Smartphones.

Learning the Relative Dynamic Features for Word-Level Lipreading

3D Convolutional Neural Networks Based Speaker Identification and Authentication.

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Electromyogram-Based Lip-Reading via Unobtrusive Dry Electrodes and Machine Learning Methods.

Lip Recognition Based on 3D Convolutional Neural Network

Decoding lip language using triboelectric sensors with deep learning

A data-efficient and easy-to-use lip language interface based on wearable motion capture and speech movement reconstruction

Deformation Flow Based Two-Stream Network for Lip Reading

A Speech-Driven 3-D Lip Synthesis with Realistic Dynamics in Mandarin Chinese

Mutual Information Maximization for Effective Lip Reading

Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries

Lip Movement Detection Using 3D Convolution and Resnet

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

Improving Speech Recognition Performance in Noisy Environments by Enhancing Lip Reading Accuracy

TCS-LipNet: Temporal & Channel & Spatial Attention-Based Lip Reading Network

HMM-based Lip Reading with Stingy Residual 3D Convolution