Combination of Multiple Embeddings for Speaker Retrieval

Xinmei Su,Qingran Zhan,Chenguang Hu,Xiang Xie
DOI: https://doi.org/10.21437/odyssey.2022-53
2022-01-01
Abstract:Speaker retrieval (SR) is a task to select the enrolled speakers from a large amount of test utterances. Extracting speaker embeddings in retrieval tasks depends on deep neural networks in general. The Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN), which is the state-of-the-art (SOTA) neural network in the field of speaker verification (SV), can also be used in solving SR problems. In this paper, we propose an extension of architecture based on ECAPA-TDNN that combines multiple embeddings in different layers. First, we replace the front TDNN layers in ECAPA-TDNN with multi-scale convolution layers that are adopted by multi-scale 1-D convolutional kernels. By ap-plying multi-scale convolution, multiple scales of feature maps are extracted and multiple information is learned by the neural network. Second, skip connections in SE-Res2blocks are added to avoid overfitting. Third, a novel pooling method is employed and concatenated with the statistic attentive pooling to achieve better performances. Combination of multiple poolings can help the network get more spatial features. The proposed system obtains a relative improvement of 22.7% comparing with the SOTA model before. A further qualitative analysis shows that our proposed system can better cluster utterances from the same speaker.
What problem does this paper attempt to address?