Deep Embedding Learning for Text-Dependent Speaker Verification.

Peng Zhang,Peng Hu,Xueliang Zhang
DOI: https://doi.org/10.21437/interspeech.2020-1354
2020-01-01
Abstract:In this paper we present an effective deep embedding learning architecture for speaker verification task. Compared with the widely used residual neural network (ResNet) and timedelay neural network (TDNN) based architectures, two main improvements are proposed: 1) We use densely connected convolutional network (DenseNet) to encode the short term context information of the speaker. 2) A bidirectional attentive pooling strategy is proposed to further model the long term temporal context and aggregate the important frames which reflect the speaker identity. We evaluate the proposed architecture on the task of text-dependent speaker verification in the Interspeech 2020 Far Field Speaker Verification Challenge (FFSVC2020). Result shows that the proposed algorithm outperforms the official baseline of FFSVC2020 with 8.06%, 19.70% minDCFs and 9.26%, 16.16% EERs relative reductions on the evaluation set of Task 1 and Task 3 respectively.
What problem does this paper attempt to address?