Self-Attention Networks for Text-Independent Speaker Verification

Tengyue Bian,Fangzhou Chen,Li Xu
DOI: https://doi.org/10.1109/ccdc.2019.8833466
2019-01-01
Abstract:In this paper, we present a self-attention based model for text-independent speaker verification task and a novel variant of the triplet loss. Conventional convolutional neural networks (CNNs) used in speaker verification tasks need very deep layers to realize considerable performance. In our proposed model, the self-attention mechanism could easily capture long-range dependencies, thus achieves better representational capability with fewer parameters. Based on triplet loss, we propose a novel triplet selection method, which makes the training more efficient and achieves significant performance enhancement. Text-independent speaker verification experiments on AISHELL-2 corpus shows that the proposed model with the improved loss function decreases the verification equal error rate (EER) by 16.81% relatively compared with the state-of-the-art ResNet-like model with common triplet loss, while the proposed model has fewer parameters and requires lower computational cost.
What problem does this paper attempt to address?