Improved I-Vector Representation for Speaker Diarization

Yan Xu,Ian McLoughlin,Yan Song,Kui Wu
DOI: https://doi.org/10.1007/s00034-015-0206-2
2015-01-01
Abstract:This paper proposes using a previously well-trained deep neural network (DNN) to enhance the i-vector representation used for speaker diarization. In effect, we replace the Gaussian mixture model typically used to train a universal background model (UBM), with a DNN that has been trained using a different large-scale dataset. To train the T-matrix, we use a supervised UBM obtained from the DNN using filterbank input features to calculate the posterior information and then MFCC features to train the UBM instead of a traditional unsupervised UBM derived from single features. Next we jointly use DNN and MFCC features to calculate the zeroth- and first-order Baum–Welch statistics for training an extractor from which we obtain the i-vector. The system will be shown to achieve a significant improvement on the NIST 2008 speaker recognition evaluation telephone data task compared to state-of-the-art approaches.
What problem does this paper attempt to address?