Abstract:<p>In this paper, we address the problem of speaker verification in conditions unseen or unknown during development. A standard method for speaker verification consists of extracting speaker embeddings with a deep neural network and processing them through a backend composed of probabilistic linear discriminant analysis (PLDA) and global logistic regression score calibration. This method is known to result in systems that work poorly on conditions different from those used to train the calibration model. We propose to modify the standard backend, introducing an adaptive calibrator that uses duration and other automatically extracted side-information to adapt to the conditions of the inputs. The backend is trained discriminatively to optimize binary cross-entropy. When trained on a number of diverse datasets that are labeled only with respect to speaker, the proposed backend consistently and, in some cases, dramatically improves calibration, compared to the standard PLDA approach, on a number of held-out datasets, some of which are markedly different from the training data. Discrimination performance is also consistently improved. We show that joint training of the PLDA and the adaptive calibrator is essential — the same benefits cannot be achieved when freezing PLDA and fine-tuning the calibrator. To our knowledge, the results in this paper are the first evidence in the literature that it is possible to develop a speaker verification system with robust out-of-the-box performance on a large variety of conditions.</p>

Angular Softmax Loss for End-to-end Speaker Verification.

Ensemble Additive Margin Softmax for Speaker Verification

Large Margin Softmax Loss for Speaker Verification

VarASV: Enabling Pitch-variable Automatic Speaker Verification Via Multi-task Learning

Real Additive Margin Softmax for Speaker Verification

Maximum Likelihood I-Vector Space Using PCA for Speaker Verification.

End-to-End Feature Learning for Text-Independent Speaker Verification

Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

Angle-based Softmax Loss for Face Verification

Contrastive Learning for improving End-to-end Speaker Verification

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

Exploring Binary Classification Loss For Speaker Verification

Adversarial Speaker Verification.

Experimenting with Additive Margins for Contrastive Self-Supervised Speaker Verification

Towards Robust Speaker Verification with Target Speaker Enhancement

Wav2sv: End-to-end Speaker Embeddings Learning from Raw Waveforms Based on Metric Learning for Speaker Verification.

DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification

Maximum Gaussianality training for deep speaker vector normalization

Challenging margin-based speaker embedding extractors by using the variational information bottleneck

X2-Softmax: Margin adaptive loss function for face recognition

A speaker verification backend with robust performance across conditions