Deep Speaker Embedding with Multi-Part Information Aggregation in Frequency-Time Domain for ASV
Xiao Li,Xiao Chen,Dongfei Wang,Zhijun Guo,Kun Niu
DOI: https://doi.org/10.1109/compsac54236.2022.00011
2022-01-01
Abstract:Automatic speaker verification (ASV) is to verify the identity of speaker from a given speech utterance without direct supervision from outside entities. Majority of recent ASV systems with deep speaker embedding apply temporal pooling or similar techniques for frame-level feature aggregation in time domain. In this paper, we propose a deep speaker embedding network for adaptively modelling and fusing multi-part information in frequency-time domain, using a modified ResNet-SO to encode acoustic features into global information, a proposed multi-part information aggregator to distinguish global information and different part features for aggregating them with adaptive weight pooling to unified utterance-level embedding descriptors. More-over, we design a privacy-preserving manner and preliminarily implement it in prototype system. Experiments are conducted on three scale datasets. We demonstrate that the presented multi-part information aggregator with adaptive weight pooling is superior for producing discriminative and robust utterance-level embedding descriptors. We also show that our network achieves state-of-the-art performance by a significant margin on the popular VoxCelebl while requiring fewer parameters than previous approaches.
What problem does this paper attempt to address?