Abstract:In this paper, we propose a novel method for speaker adaptation in lip reading, motivated by two observations. Firstly, a speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks, while the fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks to represent accurately. Therefore, we treat the shallow and deep layers differently for speaker adaptive lip reading. Secondly, we observe that a speaker's unique characteristics ( e.g. prominent oral cavity and mandible) have varied effects on lip reading performance for different words and pronunciations, necessitating adaptive enhancement or suppression of the features for robust lip reading. Based on these two observations, we propose to take advantage of the speaker's own characteristics to automatically learn separable hidden unit contributions with different targets for shallow layers and deep layers respectively. For shallow layers where features related to the speaker's characteristics are stronger than the speech content related features, we introduce speaker-adaptive features to learn for enhancing the speech content features. For deep layers where both the speaker's features and the speech content features are all expressed well, we introduce the speaker-adaptive features to learn for suppressing the speech content irrelevant noise for robust lip reading. Our approach consistently outperforms existing methods, as confirmed by comprehensive analysis and comparison across different settings. Besides the evaluation on the popular LRW-ID and GRID datasets, we also release a new dataset for evaluation, CAS-VSR-S68h, to further assess the performance in an extreme setting where just a few speakers are available but the speech content covers a large and diversified range.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the speaker adaptation problem in the task of lip-reading. Specifically, the paper proposes a new method to improve lip-reading models based on the following two observations: 1. **Distinction between Speaker Features and Speech Content Features**: - Shallow networks can effectively capture static features of the speaker (such as facial structure, lip shape, etc.), while deep networks need to capture dynamic features related to speech content. - The paper utilizes shallow networks to capture the static features of the speaker and deep networks to capture dynamic features. 2. **Impact of Speaker Features on Different Vocabulary**: - Different speaker features (such as prominent oral and jaw structures) have varying impacts on different words and pronunciations. Therefore, it is necessary to adaptively enhance or suppress these features to achieve robust lip-reading performance. Based on these two observations, the paper proposes a new speaker-adaptive lip-reading method, which adaptively learns Separable Hidden Unit Contributions using speaker features. This method enhances speech content-related features in shallow networks and suppresses noise unrelated to speech content in deep networks, thereby improving the robustness of the lip-reading model. Additionally, the paper introduces a new benchmark dataset, CAS-VSR-S68, for evaluating lip-reading tasks under extreme conditions (with only a few speakers but diverse speech content). Experimental results show that this method outperforms existing methods on multiple public datasets as well as the newly released dataset.

Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading

Lip Reading Based on 3D Face Modeling and Spatial Transformation Learning

Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language

Speaker-adaptive Lip Reading with User-dependent Padding

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Learn an Effective Lip Reading Model without Pains

Learning Speaker-specific Lip-to-Speech Generation

Learning the Relative Dynamic Features for Word-Level Lipreading

Speaker-Independent Lipreading by Disentangled Representation Learning.

Speech Guided Disentangled Visual Representation Learning for Lip Reading

Sub-word Level Lip Reading With Visual Attention

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Multi-Grained Spatio-temporal Modeling for Lip-reading

Learning from the Master: Distilling Cross-modal Advanced Knowledge for Lip Reading

Mutual Information Maximization for Effective Lip Reading

Part-Based Lipreading for Audio-Visual Speech Recognition.

Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach