Abstract:Automatic speaker verification task has made great achievements using deep learning approaches with the large-scale manually annotated dataset. However, it's very difficult and expensive to collect a large amount of well-labeled data for system building. In this paper, we propose a novel and advanced self-supervised learning framework which can construct a high performance speaker verification system without using any labeled data. To avoid the impact of false negative pairs, we adopt the self-distillation with no labels (DINO) framework as the initial model, which can be trained without exploiting negative pairs. Then, we introduce a cluster-aware training strategy for DINO to improve the diversity of data. In the iteration learning stage, due to a mass of unreliable labels from clustering, the quality of pseudo labels is important for the system training. This motivates us to propose dynamic loss-gate and label correction (DLG-LC) methods to alleviate the performance degradation caused by unreliable labels. More specifically, we model the loss distribution with GMM and obtain the loss-gate threshold dynamically to distinguish the reliable and unreliable labels. Besides, we adopt the model predictions to correct the unreliable label, for better utilizing the unreliable data rather than dropping them directly. Moreover, we extend the DLG-LC to multi-modality to further improve the performance. The experiments are performed on the commonly used Voxceleb dataset. Compared to the best-known self-supervised speaker verification system, our proposed method obtain 22.17%, 27.94% and 25.56% relative EER improvement on Vox-O, Vox-E and Vox-H test sets, even with fewer iterations, smaller models, and simpler clustering methods. More importantly, the newly proposed system even achieves comparable results with the fully supervised system, but without using any human labeled data.

Cross-Lingual Speaker Identification Using Distant Supervision

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Cross-lingual Speaker Verification with Deep Feature Learning.

Improving Bilingual Lexicon Induction on Distant Language Pairs

Cross-lingual Text-independent Speaker Verification using Unsupervised Adversarial Discriminative Domain Adaptation

Investigating Zero-Shot Generalizability on Mandarin-English Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models with Self-Supervision and Weak Supervision

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

A Neural-Network-Based Approach to Identifying Speakers in Novels

Joint Representation Learning of Cross-lingual Words and Entities via Attentive Distant Supervision

Tackling the Score Shift in Cross-Lingual Speaker Verification by Exploiting Language Information

USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained Foundation Models

A Study of Cross-Lingual Ability and Language-specific Information in Multilingual BERT

Boosting Cross-Domain Speech Recognition with Self-Supervision

Streaming Language Identification using Combination of Acoustic Representations and ASR Hypotheses

Self-Supervised Learning with Cluster-Aware-DINO for High-Performance Robust Speaker Verification

Semi-supervised multi-channel speaker diarization with cross-channel attention

Investigating cross-lingual training for offensive language detection

Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition

Look, Listen and Learn - A Multimodal LSTM for Speaker Identification