Abstract:In this paper, we propose a novel method for speaker adaptation in lip reading, motivated by two observations. Firstly, a speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks, while the fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks to represent accurately. Therefore, we treat the shallow and deep layers differently for speaker adaptive lip reading. Secondly, we observe that a speaker's unique characteristics ( e.g. prominent oral cavity and mandible) have varied effects on lip reading performance for different words and pronunciations, necessitating adaptive enhancement or suppression of the features for robust lip reading. Based on these two observations, we propose to take advantage of the speaker's own characteristics to automatically learn separable hidden unit contributions with different targets for shallow layers and deep layers respectively. For shallow layers where features related to the speaker's characteristics are stronger than the speech content related features, we introduce speaker-adaptive features to learn for enhancing the speech content features. For deep layers where both the speaker's features and the speech content features are all expressed well, we introduce the speaker-adaptive features to learn for suppressing the speech content irrelevant noise for robust lip reading. Our approach consistently outperforms existing methods, as confirmed by comprehensive analysis and comparison across different settings. Besides the evaluation on the popular LRW-ID and GRID datasets, we also release a new dataset for evaluation, CAS-VSR-S68h, to further assess the performance in an extreme setting where just a few speakers are available but the speech content covers a large and diversified range.

Shuffle is What You Need

OssCSE: Overcoming Surface Structure Bias in Contrastive Learning for Unsupervised Sentence Embedding

Acoustic Feature Shuffling Network for Text-independent Speaker Verification

Shuffle & Divide: Contrastive Learning for Long Text

ShuffleMix: Improving Representations via Channel-Wise Shuffle of Interpolated Hidden States

SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space

Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification

A Multitask Learning Framework for Speaker Change Detection with Content Information from Unsupervised Speech Decomposition

Contrastive Speaker Embedding With Sequential Disentanglement

Dynamic Shuffle: An Efficient Channel Mixture Method

Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech

ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Contrastive Framework

Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading

Shiftable Context: Addressing Training-Inference Context Mismatch in Simultaneous Speech Translation

DeltaVLAD: an Efficient Optimization Algorithm to Discriminate Speaker Embedding for Text-Independent Speaker Verification

Debiased Contrastive Learning of Unsupervised Sentence Representations

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

Whitening-based Contrastive Learning of Sentence Embeddings

Unified Video-Language Pre-training with Synchronized Audio

Simple Contrastive Representation Adversarial Learning for NLP Tasks

Improving Dino-Based Self-Supervised Speaker Verification with Progressive Cluster-Aware Training