Abstract:Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations. Speech SSL models, such as WavLM, employ masked prediction training to encode general-purpose representations. In contrast, speaker SSL models, exemplified by DINO-based models, adopt utterance-level training objectives primarily for speaker representation. Understanding how these models represent information is essential for refining model efficiency and effectiveness. Unlike the various analyses of speech SSL, there has been limited investigation into what information speaker SSL captures and how its representation differs from speech SSL or other fully-supervised speaker models. This paper addresses these fundamental questions. We explore the capacity to capture various speech properties by applying SUPERB evaluation probing tasks to speech and speaker SSL models. We also examine which layers are predominantly utilized for each task to identify differences in how speech is represented. Furthermore, we conduct direct comparisons to measure the similarities between layers within and across models. Our analysis unveils that 1) the capacity to represent content information is somewhat unrelated to enhanced speaker representation, 2) specific layers of speech SSL models would be partly specialized in capturing linguistic information, and 3) speaker SSL models tend to disregard linguistic information but exhibit more sophisticated speaker representation.

Why Does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models

Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

Investigating Self-Supervised Learning for Speech Enhancement and Separation

More Speaking or More Speakers?

Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification

UniSpeech at scale: An Empirical Study of Pre-training Method on Large-Scale Speech Recognition Dataset

Analysis of Self-Supervised Speech Models on Children's Speech and Infant Vocalizations

One-Step Knowledge Distillation and Fine-Tuning in Using Large Pre-Trained Self-Supervised Learning Models for Speaker Verification

Self-Supervised Learning Based Domain Adaptation for Robust Speaker Verification

Adapting Self-Supervised Models to Multi-Talker Speech Recognition Using Speaker Embeddings

Silence is Sweeter Than Speech: Self-Supervised Model Using Silence to Store Speaker Information

What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis

Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion

An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions

Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition

Self-supervised Text-independent Speaker Verification using Prototypical Momentum Contrastive Learning

SCDNet: Self-supervised Learning Feature-based Speaker Change Detection

Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding Tasks: Three Case Studies