Learning Multi-dimensional Speaker Localization: Axis Partitioning, Unbiased Label Distribution, and Data Augmentation

Linfeng Feng,Yijun Gong,Zhi Liu,Xiao-Lei Zhang,Xuelong Li
DOI: https://doi.org/10.1109/taslp.2024.3426309
2024-01-01
Abstract:Multi-dimensional speaker localization (SL) aims to estimate the two- or three-dimensional locations of speakers. A recent advancement in multi-dimensional SL is the end-to-end deep neural networks (DNNs) with ad-hoc microphone arrays. This method transforms the SL problem into a classification problem, i.e. a problem of identifying the grids where speakers are located. However, the classification formulation has two closely connected weaknesses. Firstly, this approach introduces quantization error, which needs a large number of grids to mitigate the error. However, increasing the number of grids leads to the curse of dimensionality. To address the problems, we propose an efficient multi-dimensional SL algorithm, which has the following three novel contributions. First, we decouple the high-dimensional grid partitioning into axis partitioning, which substantially mitigates the curse-of-dimensionality. Particularly, for the multi-speaker localization problem, we employ a separator to circumvent the permutation ambiguity of the axis partitioning in the inference stage. Second, we introduce a comprehensive unbiased label distribution scheme to further eliminate quantization errors. Finally, a set of data augmentation techniques are proposed, including coordinate transformation, stochastic node selection, and mixed training, to alleviate overfitting and sample imbalance problems. The proposed methods were evaluated on both simulated and real-world data, and the experimental results confirm the effectiveness.
What problem does this paper attempt to address?