Hierarchically Attending Time-Frequency and Channel Features for Improving Speaker Verification

Chenglong Wang,Jiangyan Yi,Jianhua Tao,Ye Bai,Zhengkun Tian
DOI: https://doi.org/10.1109/iscslp49672.2021.9362054
2021-01-01
Abstract:Attention-based models have recently shown powerful representation learning ability in speaker recognition. However, most of the attention mechanism based models primarily focus on pooling layers. In this work, we present an end-to-end speaker verification system which leverage time-frequency and channel features hierarchically. To further improve system performance, we employ Large Margin Cosine Loss to optimize the model to determine the optimal loss function. We carry out experiments on the VoxCeleb1 datasets to evaluate the effectiveness of our methods. The results suggest that our best system outperforms the i-vector + PLDA and x-vector system by 53.3% and 7.6%, respectively.
What problem does this paper attempt to address?