Speaker Change Detection with Weighted-sum Knowledge Distillation Based on Self-supervised Pre-trained Models

Hang Su,Yuxiang Kong,Lichun Fan,Peng Gao,Yujun Wang,Zhiyong Wu
DOI: https://doi.org/10.21437/interspeech.2024-1885
2024-01-01
Abstract:Speaker Change Detection (SCD) is an essential problem in speech processing and has various applications in many fields. The self-supervised models have shown impressive performance on many downstream tasks in the pre-training and fine-tuning paradigm. However, it has limitations to apply a fine-tuned self-supervised pre-trained model to frame-level SCD task in real industry because it typically requires a smaller model that consumes fewer computational resources. To tackle this issue, we propose using Knowledge Distillation (KD) to leverage the capabilities of the self-supervised model. First, a basic KD method based on the pre-trained model is proposed. Then, a weighted-sum KD method is proposed to selectively extract information from the pre-trained model. Experimental results demonstrate the effectiveness of the basic KD method as well as a further improvement for the weighted-sum KD method. The proposed method is more suitable for industrial applications compared with fine-tuning.
What problem does this paper attempt to address?