Abstract:Text-independent speaker recognition using short utterances is a highly challenging task due to the large variation and content mismatch between short utterances. I-vector and probabilistic linear discriminant analysis (PLDA) based systems have become the standard in speaker verification applications, but they are less effective with short utterances. In this paper, we first compare two state-of-the-art universal background model (UBM) training methods for i-vector modeling using full-length and short utterance evaluation tasks. The two methods are Gaussian mixture model (GMM) based (denoted I-vector_GMM) and deep neural network (DNN) based (denoted as I-vector_DNN) methods. The results indicate that the I-vector_DNN system outperforms the I-vector_GMM system under various durations (from full length to 5 s). However, the performances of both systems degrade significantly as the duration of the utterances decreases. To address this issue, we propose two novel nonlinear mapping methods which train DNN models to map the i-vectors extracted from short utterances to their corresponding long-utterance i-vectors. The mapped i-vector can restore missing information and reduce the variance of the original short-utterance i-vectors. The proposed methods both model the joint representation of short and long utterance i-vectors: the first method trains an autoencoder first using concatenated short and long utterance i-vectors and then uses the pre-trained weights to initialize a supervised regression model from the short to long version; the second method jointly trains the supervised regression model with an autoencoder reconstructing the short utterance i-vector itself. Experimental results using the NIST SRE 2010 dataset show that both methods provide significant improvement and result in a 24.51% relative improvement in Equal Error Rates (EERs) from a baseline system. In order to learn a better joint representation, we further investigate the effect of a deep encoder with residual blocks, and the results indicate that the residual network can further improve the EERs of a baseline system by up to 26.47%. Moreover, in order to improve the short i-vector mapping to its long version, an additional vector, which represents the average value of phoneme posteriors across frames, is also added to the input, and results in a 28.43% improvement. When further testing the best-validated models of SRE10 on the Speaker In The Wild (SITW) dataset, the methods result in a 23.12% improvement on arbitrary-duration (1–5 s) short-utterance conditions.

Short Utterance Compensation in Speaker Verification via Cosine-Based Teacher-Student Learning of Speaker Embeddings

Short Utterance Speaker Recognition Based on Speech High Frequency Information Compensation and Dynamic Feature Enhancement Methods

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

FA-ExU-Net: the simultaneous training of an embedding extractor and enhancement model for a speaker verification system robust to short noisy utterances

Self-attention Based Speaker Recognition Using Cluster-Range Loss

Introducing Phonetic Information to Speaker Embedding for Speaker Verification

The Sogou System for Short-duration Speaker Verification Challenge 2021

Deep neural network based i-vector mapping for speaker verification using short utterances

Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

Improving Short-Duration Speaker Recognition by Joint Bark-Wavelet Acoustic Feature Coupling and Triplet Dual-Attention Mechanism Network

Deep Segment Attentive Embedding for Duration Robust Speaker Verification

Neural Scoring, Not Embedding: A Novel Framework for Robust Speaker Verification

DNN based Speaker Recognition on Short Utterances

Few-shot short utterance speaker verification using meta-learning

Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification

Session Variability Subspace Projection Based Model Compensation for Speaker Verification

MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveforms

ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational Efficiency

Integrated Replay Spoofing-Aware Text-Independent Speaker Verification

Speaker Verification using Convolutional Neural Networks

Wav2sv: End-to-end Speaker Embeddings Learning from Raw Waveforms Based on Metric Learning for Speaker Verification.