Abstract:Lipreading is a task of decoding the movement of the speaker’s lip region into text. In recent years, lipreading methods based on deep neural network have attracted widespread attention, and the accuracy has far surpassed that of experienced human lipreaders. The visual differences in some phonemes are extremely subtle and pose a great challenge to lipreading. Most of the lipreading existing methods do not process the extracted visual features, that mainly suffer from two problems. Firstly, the extracted features contain lot of useless information such as noise caused by differences in speech speed and lip shape, for example. In addition, the extracted features are not abstract enough to distinguish phonemes with similar pronunciation. These problems have a bad effect on the performance of lipreading. In order to extract features from the lip regions that are more distinguishable and more relevant to the speech content, this paper proposes an end-to-end deep neural network-based lipreading model (LCSNet). The proposed model extracts the short-term spatio-temporal features and the motion trajectory features from the lip region in the video clips. The extracted features are filtered by the channel attention module to eliminate the useless features, and then used as input to the proposed Selective Feature Fusion Module (SFFM) in order to extract the high-level abstract features. Afterwards, these features are used as input to the bidirectional GRU network in time order for temporal modeling, in order to obtain the long-term spatio-temporal features. Finally, a Connectionist Temporal Classification (CTC) decoder is used to generate the output text. The experimental results show that the proposed model achieves a 1.0% CER and 2.3% WER on the GRID corpus database, which respectively represents an improvement of 52% and 47% compared to LipNet.

Deformation Flow Based Two-Stream Network for Lip Reading

Lip Reading Based on 3D Face Modeling and Spatial Transformation Learning

Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

Pathogenesis of avian flu H5N1 and SARS.

Adaptive Semantic-Spatio-Temporal Graph Convolutional Network for Lip Reading

TalkingFlow: Talking Facial Landmark Generation with Multi-Scale Normalizing Flow Network

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

LCSNet: End-to-End Lipreading with Channel-aware Feature Selection

Two-Stream Network for Sign Language Recognition and Translation

HMM-based Lip Reading with Stingy Residual 3D Convolution

Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading

Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention

Parallel and High-Fidelity Text-to-Lip Generation

Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries

TCS-LipNet: Temporal & Channel & Spatial Attention-Based Lip Reading Network

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

DualLip: A System for Joint Lip Reading and Generation

Multi-Grained Spatio-temporal Modeling for Lip-reading

Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder