Abstract:Lipreading is a task of decoding the movement of the speaker’s lip region into text. In recent years, lipreading methods based on deep neural network have attracted widespread attention, and the accuracy has far surpassed that of experienced human lipreaders. The visual differences in some phonemes are extremely subtle and pose a great challenge to lipreading. Most of the lipreading existing methods do not process the extracted visual features, that mainly suffer from two problems. Firstly, the extracted features contain lot of useless information such as noise caused by differences in speech speed and lip shape, for example. In addition, the extracted features are not abstract enough to distinguish phonemes with similar pronunciation. These problems have a bad effect on the performance of lipreading. In order to extract features from the lip regions that are more distinguishable and more relevant to the speech content, this paper proposes an end-to-end deep neural network-based lipreading model (LCSNet). The proposed model extracts the short-term spatio-temporal features and the motion trajectory features from the lip region in the video clips. The extracted features are filtered by the channel attention module to eliminate the useless features, and then used as input to the proposed Selective Feature Fusion Module (SFFM) in order to extract the high-level abstract features. Afterwards, these features are used as input to the bidirectional GRU network in time order for temporal modeling, in order to obtain the long-term spatio-temporal features. Finally, a Connectionist Temporal Classification (CTC) decoder is used to generate the output text. The experimental results show that the proposed model achieves a 1.0% CER and 2.3% WER on the GRID corpus database, which respectively represents an improvement of 52% and 47% compared to LipNet.

Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading

Lip Reading Based on 3D Face Modeling and Spatial Transformation Learning

Adaptive Semantic-Spatio-Temporal Graph Convolutional Network for Lip Reading

HMM-based Lip Reading with Stingy Residual 3D Convolution

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory

Multi-Grained Spatio-temporal Modeling for Lip-reading

Learning the Relative Dynamic Features for Word-Level Lipreading

TCS-LipNet: Temporal & Channel & Spatial Attention-Based Lip Reading Network

Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion.

Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries

LCSNet: End-to-End Lipreading with Channel-aware Feature Selection

Lip-reading with Densely Connected Temporal Convolutional Networks

Connectionist Temporal Fusion For Sign Language Translation

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer

Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Audio-Visual Speech Recognition Using A Two-Step Feature Fusion Strategy.

Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip Reading