A Study of Multichannel Spatiotemporal Features and Knowledge Distillation on Robust Target Speaker Extraction

Yichi Wang,Jie Zhang,Shihao Chen,Weitai Zhang,Zhongyi Ye,Xinyuan Zhou,Lirong Dai
DOI: https://doi.org/10.1109/icassp48485.2024.10446870
2024-01-01
Abstract:Target speaker extraction (TSE) based on direction of arrival (DOA) has a wide range of applications in e.g., remote conferencing, hearing aids, in-car speech interaction. Due to the inherent phase uncertainty, existing TSE methods usually suffer from speaker confusion within specific frequency bands. Imprecise DOA measurements caused by e.g., the calibration of the microphone array and ambient noises, can also deteriorate the TSE performance. In order to improve the robustness of TSE, in this work we propose several new multichannel spatiotemporal features to represent the discriminability of the target speaker. The narrow-band Conformer model is applied in combination with the proposed features to facilitate the extraction of the target speaker. In addition, we consider knowledge distillation for improving the model robustness, particularly in the presence of DOA mis-match. Experimental results on a public dataset verify the efficacy of the proposed method.
What problem does this paper attempt to address?