Deep Speaker Embedding Extraction with Channel-Wise Feature Responses and Additive Supervision Softmax Loss Function

Jianfeng Zhou,Tao Jiang,Zheng Li,Lin Li,Qingyang Hong
DOI: https://doi.org/10.21437/interspeech.2019-1704
2019-01-01
Abstract:In speaker verification, the convolutional neural networks (CNN) have been successfully leveraged to achieve a great performance. Most of the models based on CNN primarily focus on learning the distinctive speaker embedding from the horizontal direction (time-axis). However, the feature relationship between channels is usually neglected. In this paper, we firstly aim toward an alternate direction of recalibrating the channel-wise features by introducing the recently proposed "squeeze-and-excitation" (SE) module for image classification. We effectively incorporate the SE blocks in the deep residual networks (ResNet-SE) and demonstrate a slightly improvement on VoxCeleb corpuses. Additionally, we propose a new loss function, namely additive supervision softmax (AS-Softmax), to make full use of the prior knowledge of the mis-classified samples at training stage by imposing more penalty on the mis-classified samples to regularize the training process. The experimental results on VoxCeleb corpuses demonstrate that the proposed loss could further improve the performance of speaker system, especially on the case that the combination of the ResNet-SE and the AS-Softmax.
What problem does this paper attempt to address?