Revisiting the Statistics Pooling Layer in Deep Speaker Embedding Learning

Shuai Wang,Yexin Yang,Yanmin Qian,Kai Yu
DOI: https://doi.org/10.1109/iscslp49672.2021.9362097
2021-01-01
Abstract:The pooling function plays a vital role in the segment-level deep speaker embedding learning framework. One common method is to calculate the statistics of the temporal features, while the mean based temporal average pooling (TAP) and temporal statistics pooling (TSTP) which combine mean and standard deviation are two typical approaches. Empirically, researchers observe a big performance degradation in x-vector when removing the standard deviation. Based on this observation, in this paper, we designed a set of experiments to analyze the effectiveness of different statistics quantitatively, including the investigation and comparison on pooling functions based on standard deviation, covariance and ℓp-norm. Experiments are carried out on Vox-Celeb and SRE16, and the results show that the second-order statistics based pooling functions yield better performance than TAP, and only the simple standard deviation can achieve the best performance on all the evaluation conditions.
What problem does this paper attempt to address?