End-to-End Feature Learning for Text-Independent Speaker Verification

Fangzhou Chen,Tengyue Bian,Li Xu
DOI: https://doi.org/10.1109/ccdc.2019.8833087
2019-01-01
Abstract:Deep neural networks (DNNs) have found widespread use in text-independent speaker verification, especially the convolutional models with triplet loss. However, the training efficiency and the quality of learned features are not sufficiently good. In this paper, we present an end-to-end framework to train speaker verification models efficiently. In details, we introduce redesigned residual blocks in neural network architecture and propose a way of selecting hard triplets to improve original triplet loss function. Furthermore, the effects of hyperparameters and framing strategy in input pipeline are investigated for fine-tuning. Experimental results on the Librispeech and AISHELL-2 datasets demonstrate that the proposed method can reduce the verification equal error rate by greater than 20% relatively, which confirms the advantage of proposed methods comparing to methods in previous work.
What problem does this paper attempt to address?