Improved Word-level Lipreading with Temporal Shrinkage Network and NetVLAD

Heng Yang,Tao Luo,Yakun Zhang,Mingwu Song,Liang Xie,Ye Yan,Erwei Yin
DOI: https://doi.org/10.1145/3536221.3556628
2022-01-01
Abstract:In most word-level lipreading architectures of recent years, temporal feature extraction module tend to employ Multi-scale Temporal Convolution Network (MS-TCN). In our experiments, we have noticed it is hard for MS-TCN to deal with noise information that may contain in image sequences. In order to solve the problems, we propose a lipreading architecture based on temporal shrinkage network and NetVLAD. We first propose Temporal Shrinkage Unit according to Residual Shrinkage Network and then replace temporal convolution unit with it. The improved network which named Multi-scale Temporal Shrinkage Network (MS-TSN) could focus more on relevant information. Following with MS-TSN that deals with noise frames, NetVLAD is proposed to integrate local information into global feature. Compared with Global Average Pooling, NetVLAD could extract key features by clustering. Our experiments on Lipreading in the Wild (LRW) show that the architecture we propose achieves an accuracy of 89.41%, attaining new state-of-the-art in word-level lipreading. In addition, we build a new Mandarin Chinese lipreading dataset named MCLR-100 and verify our proposed architecture on it.
What problem does this paper attempt to address?