TCS-LipNet: Temporal & Channel & Spatial Attention-Based Lip Reading Network

Huanjie Chen,Wenjuan Li,Zhigang Cheng,Xiubo Liang,Qifei Zhang
DOI: https://doi.org/10.1007/978-3-031-44201-8_34
2023-01-01
Abstract:Lip-reading is the process of translating input lip-movement image sequences into text sequences, which is a task that requires both temporal and spatial information to be considered, and feature extraction is difficult. In this regard, this paper proposes a new lip reading model, TCS-LipNet, which innovatively proposes the temporal channel space attention mechanism module TCSAM, and compared with the channel space attention mechanism, TCS increases the association of channel space features in the temporal dimension and improves the performance of the model. TCS-LipNet uses the TCSAM-based ResNet18 network as the front-end module to enhance the extraction of visual features, and DC-TCN (Densely Connected Temporal Convolutional Networks) as the back-end module to address the temporal correlation of sequences. The experimental data show that TCS-LipNet achieves 92.2% accuracy on LRW, which is the highest accuracy rate currently.
What problem does this paper attempt to address?