Abstract:Automatic speaker verification task has made great achievements using deep learning approaches with the large-scale manually annotated dataset. However, it's very difficult and expensive to collect a large amount of well-labeled data for system building. In this paper, we propose a novel and advanced self-supervised learning framework which can construct a high performance speaker verification system without using any labeled data. To avoid the impact of false negative pairs, we adopt the self-distillation with no labels (DINO) framework as the initial model, which can be trained without exploiting negative pairs. Then, we introduce a cluster-aware training strategy for DINO to improve the diversity of data. In the iteration learning stage, due to a mass of unreliable labels from clustering, the quality of pseudo labels is important for the system training. This motivates us to propose dynamic loss-gate and label correction (DLG-LC) methods to alleviate the performance degradation caused by unreliable labels. More specifically, we model the loss distribution with GMM and obtain the loss-gate threshold dynamically to distinguish the reliable and unreliable labels. Besides, we adopt the model predictions to correct the unreliable label, for better utilizing the unreliable data rather than dropping them directly. Moreover, we extend the DLG-LC to multi-modality to further improve the performance. The experiments are performed on the commonly used Voxceleb dataset. Compared to the best-known self-supervised speaker verification system, our proposed method obtain 22.17%, 27.94% and 25.56% relative EER improvement on Vox-O, Vox-E and Vox-H test sets, even with fewer iterations, smaller models, and simpler clustering methods. More importantly, the newly proposed system even achieves comparable results with the fully supervised system, but without using any human labeled data.

Improving Speaker Verification with Noise-Aware Label Ensembling and Sample Selection: Learning and Correcting Noisy Speaker Labels

Rethinking Noisy Label Learning in Real-world Annotation Scenarios from the Noise-type Perspective

Robust Training for Speaker Verification Against Noisy Labels

A Label Noise Robust Stacked Auto-Encoder Algorithm for Inaccurate Supervised Classification Problems

Learning with Noisy Labels Via Self-supervised Adversarial Noisy Masking

Inconsistency Ranking-based Noisy Label Detection for High-quality Data

CEC: A Noisy Label Detection Method for Speaker Recognition

Learning to Detect Noisy Labels Using Model-Based Features

Self-Supervised Learning with Cluster-Aware-DINO for High-Performance Robust Speaker Verification

Self-Supervised Speaker Verification Using Dynamic Loss-Gate and Label Correction

Self-Supervised Speaker Verification with Mini-Batch Prediction Correction

Speaker recognition with two-step multi-modal deep cleansing

Two Wrongs Don't Make a Right: Combating Confirmation Bias in Learning with Label Noise.

Label-noise learning via uncertainty-aware neighborhood sample selection

Reliable Label Correction is a Good Booster When Learning with Extremely Noisy Labels.

Learning with Imbalanced Noisy Data by Preventing Bias in Sample Selection

A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition

Learning from Noisy Labels with Coarse-to-Fine Sample Credibility Modeling

Model-agnostic Approaches to Handling Noisy Labels When Training Sound Event Classifiers

Combating Label Noise With A General Surrogate Model For Sample Selection