Abstract:Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and back-end models may lead to a local minimum, which theoretically prevents the whole system from achieving the best optimization. Although some methods have been proposed for jointly optimizing the two models, such as the generalized end-to-end (GE2E) model and NPLDA E2E model, most of these methods have not fully investigated how to model the intra-relationship between multiple enrollment utterances. In this paper, we propose a new E2E joint method for speaker verification especially designed for the practical scenario of multiple enrollment utterances. To leverage the intra-relationship among multiple enrollment utterances, our model comes equipped with frame-level and utterance-level attention mechanisms. Additionally, focal loss is utilized to balance the importance of positive and negative samples within a mini-batch and focus on the difficult samples during the training process. We also utilize several data augmentation techniques, including conventional noise augmentation using MUSAN and RIRs datasets and a unique speaker embedding-level mixup strategy for better optimization.

Ensemble Additive Margin Softmax for Speaker Verification

Real Additive Margin Softmax for Speaker Verification

Large Margin Softmax Loss for Speaker Verification

Angular Softmax Loss for End-to-end Speaker Verification.

Experimenting with Additive Margins for Contrastive Self-Supervised Speaker Verification

Improved Large-Margin Softmax Loss for Speaker Diarisation

Wav2sv: End-to-end Speaker Embeddings Learning from Raw Waveforms Based on Metric Learning for Speaker Verification.

Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

Improved Vocal Effort Transfer Vector Estimation for Vocal Effort-Robust Speaker Verification

Contrastive Learning for improving End-to-end Speaker Verification

Margin-Mixup: A Method for Robust Speaker Verification in Multi-Speaker Audio

Exploring Binary Classification Loss For Speaker Verification

VOT: Revolutionizing Speaker Verification with Memory and Attention Mechanisms

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

Getting More for Less: Using Weak Labels and AV-Mixup for Robust Audio-Visual Speaker Verification

DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification

Self-attention Based Speaker Recognition Using Cluster-Range Loss

Dual-model self-regularization and fusion for domain adaptation of robust speaker verification

A Comparison of Metric Learning Loss Functions for End-To-End Speaker Verification

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification