Improving Speech Separation with Knowledge Distilled from Self-supervised Pre-trained Models

Bowen Qu,Chenda Li,Jinfeng Bai,Yanmin Qian
DOI: https://doi.org/10.1109/iscslp57327.2022.10038203
2022-01-01
Abstract:Large-scale self-supervised learning (SSL) models have shown outstanding ability in many speech processing tasks. Most of the SSL models in the literature are trained with datasets where the single-talker utterances dominate. It may not be optimal to directly apply these SSL models to speech separation tasks. Besides, many computational costs in large-scale SSL models increase the overall complexity of the speech separation system. In this paper, we explore the application of pre-trained SSL models in the speech separation task. Instead of using the SSL model directly, we designed an SSL feature predictor to estimate single-talker’s deep features from the speech mixture. The SSL feature predictor is trained with the knowledge distilled from the pre-trained Wav2Vec2.0 model. Our experiments show that the performance of time-domain speech separation can be improved obviously by leveraging the SSL embedding predictor.
What problem does this paper attempt to address?