Speech Guided Disentangled Visual Representation Learning for Lip Reading

Ya Zhao,Cheng Ma,Zunlei Feng,Mingli Song
DOI: https://doi.org/10.1145/3462244.3479952
2021-10-18
Abstract:Lip reading has achieved unparalleled development in recent years. However, existing methods have two main problems: 1) there is no explicit mechanism to ensure that the extracted visual features are only related to lip movements, resulting in degraded performance when video contains large variations, such as speakers’ poses; 2) quantities of labeled data are required to achieve good results, which are difficult to obtain in low-resource languages. In this paper, we propose a new visual representation learning method, SVLR, whose purpose is to extract disentangled, lip movements related visual features for lip reading task, by making use of quantities of unlabeled audio-visual data. This is achieved by explicitly disentangling the feature into lip movements related part and speaker identity related part. Then predicting speech from the disentangled features is used as the training objective to optimize model parameters. After this cross-modal training, the video encoder that extracts lip movements features is used as a feature extractor for the lip reading task. Various experiments on several word-level lip reading benchmarks have proved the effectiveness of the proposed method.
What problem does this paper attempt to address?