Abstract:Automatic speaker verification task has made great achievements using deep learning approaches with the large-scale manually annotated dataset. However, it's very difficult and expensive to collect a large amount of well-labeled data for system building. In this paper, we propose a novel and advanced self-supervised learning framework which can construct a high performance speaker verification system without using any labeled data. To avoid the impact of false negative pairs, we adopt the self-distillation with no labels (DINO) framework as the initial model, which can be trained without exploiting negative pairs. Then, we introduce a cluster-aware training strategy for DINO to improve the diversity of data. In the iteration learning stage, due to a mass of unreliable labels from clustering, the quality of pseudo labels is important for the system training. This motivates us to propose dynamic loss-gate and label correction (DLG-LC) methods to alleviate the performance degradation caused by unreliable labels. More specifically, we model the loss distribution with GMM and obtain the loss-gate threshold dynamically to distinguish the reliable and unreliable labels. Besides, we adopt the model predictions to correct the unreliable label, for better utilizing the unreliable data rather than dropping them directly. Moreover, we extend the DLG-LC to multi-modality to further improve the performance. The experiments are performed on the commonly used Voxceleb dataset. Compared to the best-known self-supervised speaker verification system, our proposed method obtain 22.17%, 27.94% and 25.56% relative EER improvement on Vox-O, Vox-E and Vox-H test sets, even with fewer iterations, smaller models, and simpler clustering methods. More importantly, the newly proposed system even achieves comparable results with the fully supervised system, but without using any human labeled data.

SYLLABLE-DEPENDENT DISCRIMINATIVE LEARNING FOR SMALL FOOTPRINT TEXT-DEPENDENT SPEAKER VERIFICATION

A text-dependent speaker verification application framework based on Chinese numerical string corpus

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

Deep Segment Attentive Embedding for Duration Robust Speaker Verification

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

Self-attention Based Speaker Recognition Using Cluster-Range Loss

Spectral Clustering-Aware Learning of Embeddings for Speaker Diarisation

Short Utterance Speaker Recognition Based on Speech High Frequency Information Compensation and Dynamic Feature Enhancement Methods

Pseudo-Phoneme Label Loss for Text-Independent Speaker Verification

A framework of text-dependent speaker verification for chinese numerical string corpus

Speaker Recognition Based on Pre-Trained Model and Deep Clustering

Y-Vector: Multiscale Waveform Encoder for Speaker Embedding

Introducing Multilingual Phonetic Information to Speaker Embedding for Speaker Verification

Wav2sv: End-to-end Speaker Embeddings Learning from Raw Waveforms Based on Metric Learning for Speaker Verification.

An Efficient and Interpre Table Speech Enhancement Network Via Deep Dictionary Learning.

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

Introducing Phonetic Information to Speaker Embedding for Speaker Verification

End-to-End Attention based Text-Dependent Speaker Verification

Learning Deep Embedding with Acoustic and Phoneme Features for Speaker Recognition in FM Broadcasting

Self-Supervised Learning with Cluster-Aware-DINO for High-Performance Robust Speaker Verification