Competing Speaker Count Estimation on the Fusion of the Spectral and Spatial Embedding Space.

Chao Peng,Xihong Wu,Tianshu Qu
DOI: https://doi.org/10.21437/interspeech.2020-1781
2020-01-01
Abstract:This paper presents a method for estimating the competing speaker count with deep spectral and spatial embedding fusion. The basic idea is that mixed speech can be projected into an embedding space using neural networks where embedding vectors are orthogonal for different speakers while parallel for the same speaker. Therefore, speaker count estimation can be performed by computing the rank of the mean covariance matrix of the embedding vectors. It is also a feature combination method in speaker embedding space instead of simply combining features at the input layer of neural networks. Experimental results show that embedding-based method is better than classificationbased method where the network directly predicts the count of speakers and spatial features help to speaker count estimation. In addition, the features combined in the embedding space can achieve more accurate speaker counting than features combined at the input layer of nueral networks when tested on anechoic and reverberant datasets.
What problem does this paper attempt to address?