Speaker-Independent Lipreading with Limited Data.

Chenzhao Yang,Shilin Wang,Xingxuan Zhang,Yun Zhu
DOI: https://doi.org/10.1109/icip40778.2020.9190780
2020-01-01
Abstract:Recent researches have demonstrated that with a huge annotated training dataset, some sophisticated automatic lipreading methods perform even better than a professional human lip reader. However, when the training set is limited, i.e. containing a few number of speakers, most existing lipreading approaches cannot provide accurate recognition results for unseen speakers due to the inter-speaker variability. To improve the lipreading performance in the speaker-independent scenario, a new deep neural network (DNN) is proposed in this paper. The proposed network is composed of two parts, i.e. the Transformer-based Visual Speech Recognition Network (TVSR-Net) and the Speaker Confusion Block (SC-Block). The TVSR-Net is designed to extract lip features and recognize the speech. The SC-Block aims to achieve speaker normalization by eliminating the influence of various talking styles/habits. A Multi-Task Learning (MTL) scheme is designed for network optimization. Experiment results on the GRID dataset have demonstrated the effectiveness of the proposed network on speaker-independent recognition with limited training data.
What problem does this paper attempt to address?