Abstract:In this paper, we propose a novel method for speaker adaptation in lip reading, motivated by two observations. Firstly, a speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks, while the fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks to represent accurately. Therefore, we treat the shallow and deep layers differently for speaker adaptive lip reading. Secondly, we observe that a speaker's unique characteristics ( e.g. prominent oral cavity and mandible) have varied effects on lip reading performance for different words and pronunciations, necessitating adaptive enhancement or suppression of the features for robust lip reading. Based on these two observations, we propose to take advantage of the speaker's own characteristics to automatically learn separable hidden unit contributions with different targets for shallow layers and deep layers respectively. For shallow layers where features related to the speaker's characteristics are stronger than the speech content related features, we introduce speaker-adaptive features to learn for enhancing the speech content features. For deep layers where both the speaker's features and the speech content features are all expressed well, we introduce the speaker-adaptive features to learn for suppressing the speech content irrelevant noise for robust lip reading. Our approach consistently outperforms existing methods, as confirmed by comprehensive analysis and comparison across different settings. Besides the evaluation on the popular LRW-ID and GRID datasets, we also release a new dataset for evaluation, CAS-VSR-S68h, to further assess the performance in an extreme setting where just a few speakers are available but the speech content covers a large and diversified range.

Revisiting the Statistics Pooling Layer in Deep Speaker Embedding Learning

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

A text-dependent speaker verification application framework based on Chinese numerical string corpus

Deep Segment Attentive Embedding for Duration Robust Speaker Verification

Removing Speaker Information from Speech Representation using Variable-Length Soft Pooling

Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

Unsupervised Speech Representation Pooling Using Vector Quantization

Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation

A Comparison of Pooling Methods on LSTM Models for Rare Acoustic Event Classification

Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation

Deep CNNs along the Time Axis with Intermap Pooling for Robustness to Spectral Variations

Attentive Pooling with Learnable Norms for Text Representation.

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales.

Phonetic-Attention Scoring for Deep Speaker Features in Speaker Verification

Double Multi-Head Attention for Speaker Verification

Speaker Characterization by means of Attention Pooling

Exploring Sequential Characteristics in Speaker Bottleneck Feature for Text-Dependent Speaker Verification.

Comparative Analysis of Pooling Mechanisms in LLMs: A Sentiment Analysis Perspective

Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading

Multi-View Speaker Embedding Learning for Enhanced Stability and Discriminability.