Abstract:Automatic speaker verification task has made great achievements using deep learning approaches with the large-scale manually annotated dataset. However, it's very difficult and expensive to collect a large amount of well-labeled data for system building. In this paper, we propose a novel and advanced self-supervised learning framework which can construct a high performance speaker verification system without using any labeled data. To avoid the impact of false negative pairs, we adopt the self-distillation with no labels (DINO) framework as the initial model, which can be trained without exploiting negative pairs. Then, we introduce a cluster-aware training strategy for DINO to improve the diversity of data. In the iteration learning stage, due to a mass of unreliable labels from clustering, the quality of pseudo labels is important for the system training. This motivates us to propose dynamic loss-gate and label correction (DLG-LC) methods to alleviate the performance degradation caused by unreliable labels. More specifically, we model the loss distribution with GMM and obtain the loss-gate threshold dynamically to distinguish the reliable and unreliable labels. Besides, we adopt the model predictions to correct the unreliable label, for better utilizing the unreliable data rather than dropping them directly. Moreover, we extend the DLG-LC to multi-modality to further improve the performance. The experiments are performed on the commonly used Voxceleb dataset. Compared to the best-known self-supervised speaker verification system, our proposed method obtain 22.17%, 27.94% and 25.56% relative EER improvement on Vox-O, Vox-E and Vox-H test sets, even with fewer iterations, smaller models, and simpler clustering methods. More importantly, the newly proposed system even achieves comparable results with the fully supervised system, but without using any human labeled data.

SpeakerGAN: Speaker identification with conditional generative adversarial network

Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification

Targeted Speech Adversarial Example Generation With Generative Adversarial Network

Self-attention Based Speaker Recognition Using Cluster-Range Loss

Extraction of Noise-Robust Speaker Embedding Based on Generative Adversarial Networks

GMM and CNN Hybrid Method for Short Utterance Speaker Recognition

SEC-GAN for robust speaker recognition with emotional state dismatch

NVCGAN: Leveraging Generative Adversarial Networks for Robust Voice Conversion

Vocoder Detection of Spoofing Speech Based on GAN Fingerprints and Domain Generalization

Data Augmentation using Conditional Generative Adversarial Networks for Robust Speech Recognition

Data Augmentation Using Deep Generative Models for Embedding Based Speaker Recognition

GMM-ResNext: Combining Generative and Discriminative Models for Speaker Verification

CGA-MGAN: Metric GAN Based on Convolution-Augmented Gated Attention for Speech Enhancement

An Effective Speaker Recognition Method Based on Joint Identification and Verification Supervisions.

Neural Predictive Coding using Convolutional Neural Networks towards Unsupervised Learning of Speaker Characteristics

Few-Shot Speaker Identification Using Depthwise Separable Convolutional Network with Channel Attention

Self-Supervised Learning with Cluster-Aware-DINO for High-Performance Robust Speaker Verification

Speaker Identification System Based on Hybrid Neural Network

Time-domain Speech Super-resolution with GAN based Modeling for Telephony Speaker Verification

Speaker Identification Based on Classify Feature Sub-space Gaussian Mixture Model and Neural Net Fusion