Text-independent Speaker Recognition Based on X-vector
Lianyu Zhou,Mingjiang Wang,Yukun Qian,Huaiwen Luo,Heng Li,Xu Lin
DOI: https://doi.org/10.1109/icsip55141.2022.9887021
2022-01-01
Abstract:Speaker recognition is also called voiceprint recognition. The current state-of-the-art technology for speaker recognition is to use deep neural networks to extract features of the speaker's speech. This embedded feature extracted by DNN is generally called x-vector. Recently, resnet-based structures have received extensive attention and have gradually become the basis for speaker recognition research. In terms of model input, the most commonly used features include Linear Prediction Coefficient, Mel Frequency Cepstral Coefficient, Mel Filter Bank, and Spectrogram. However, a single feature cannot reveal all the features of speech. In this paper, we propose a text-independent speaker recognition algorithm based on fused features and x-vector architecture, in which we use LPC, F-bank and Spectrogram for acoustic features and fuse them at frame level, we use the currently popular ResNet as model for training and modify its structure, we use the additive angular margin loss for classification loss function. The experiments show that our proposed fusion feature and modified ResNet achieves remarkable Equal Error Rate of 0.9 for the VTCK dataset, which greatly improves the accuracy of speaker recognition.