LCSNet: End-to-End Lipreading with Channel-aware Feature Selection

Feng Xue,Tian Yang,Kang Liu,Zikun Hong,Mingwei Cao,Dan Guo,Richang Hong,Feng xue
DOI: https://doi.org/10.1145/3524620
2022-03-17
Abstract:Lipreading is a task of decoding the movement of the speaker’s lip region into text. In recent years, lipreading methods based on deep neural network have attracted widespread attention, and the accuracy has far surpassed that of experienced human lipreaders. The visual differences in some phonemes are extremely subtle and pose a great challenge to lipreading. Most of the lipreading existing methods do not process the extracted visual features, that mainly suffer from two problems. Firstly, the extracted features contain lot of useless information such as noise caused by differences in speech speed and lip shape, for example. In addition, the extracted features are not abstract enough to distinguish phonemes with similar pronunciation. These problems have a bad effect on the performance of lipreading. In order to extract features from the lip regions that are more distinguishable and more relevant to the speech content, this paper proposes an end-to-end deep neural network-based lipreading model (LCSNet). The proposed model extracts the short-term spatio-temporal features and the motion trajectory features from the lip region in the video clips. The extracted features are filtered by the channel attention module to eliminate the useless features, and then used as input to the proposed Selective Feature Fusion Module (SFFM) in order to extract the high-level abstract features. Afterwards, these features are used as input to the bidirectional GRU network in time order for temporal modeling, in order to obtain the long-term spatio-temporal features. Finally, a Connectionist Temporal Classification (CTC) decoder is used to generate the output text. The experimental results show that the proposed model achieves a 1.0% CER and 2.3% WER on the GRID corpus database, which respectively represents an improvement of 52% and 47% compared to LipNet.
What problem does this paper attempt to address?