Abstract:In this paper, we propose a novel method for speaker adaptation in lip reading, motivated by two observations. Firstly, a speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks, while the fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks to represent accurately. Therefore, we treat the shallow and deep layers differently for speaker adaptive lip reading. Secondly, we observe that a speaker's unique characteristics ( e.g. prominent oral cavity and mandible) have varied effects on lip reading performance for different words and pronunciations, necessitating adaptive enhancement or suppression of the features for robust lip reading. Based on these two observations, we propose to take advantage of the speaker's own characteristics to automatically learn separable hidden unit contributions with different targets for shallow layers and deep layers respectively. For shallow layers where features related to the speaker's characteristics are stronger than the speech content related features, we introduce speaker-adaptive features to learn for enhancing the speech content features. For deep layers where both the speaker's features and the speech content features are all expressed well, we introduce the speaker-adaptive features to learn for suppressing the speech content irrelevant noise for robust lip reading. Our approach consistently outperforms existing methods, as confirmed by comprehensive analysis and comparison across different settings. Besides the evaluation on the popular LRW-ID and GRID datasets, we also release a new dataset for evaluation, CAS-VSR-S68h, to further assess the performance in an extreme setting where just a few speakers are available but the speech content covers a large and diversified range.

Speech Guided Disentangled Visual Representation Learning for Lip Reading

Lip Reading Based on 3D Face Modeling and Spatial Transformation Learning

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Learn an Effective Lip Reading Model without Pains

Sub-word Level Lip Reading With Visual Attention

Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip Reading

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Tackling Event-Based Lip-Reading by Exploring Multigrained Spatiotemporal Clues

Importance-Aware Information Bottleneck Learning Paradigm for Lip Reading

Multi-Grained Spatio-temporal Modeling for Lip-reading

Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion.

Disentangled Speech Representation Learning for One-Shot Cross-Lingual Voice Conversion Using SS-Vae

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Disentangling Homophemes in Lip Reading using Perplexity Analysis