Curricular Contrastive Regularization for Speech Enhancement with Self-Supervised Representations.

Xinmeng Xu,Chang Han,Yiqun Zhang,Weiping Tu,Yuhong Yang
DOI: https://doi.org/10.1109/ICASSP48485.2024.10445912
2024-01-01
Abstract:Existing deep learning-based speech enhancement methods only adopt clean speech as positive samples to guide the training of speech enhancement networks while negative samples, i.e., noisy speech, are unexploited. In this paper, we adopt contrastive regularization (CR) built upon contrastive learning to exploit both the information of noisy and clean speech as negative and positive samples, respectively. Particularly, CR minimizes the distance between clean and enhanced speech and maximizes the distance between noisy and enhanced speech in the representation space of the self-supervised learning model. However, the contrastive samples are non-consensual, as the negatives are usually represented distantly from the clean speech, leaving the solution space still under-constricted. To tackle this issue, we provide the negative samples assembled from (1) the noisy speech, and (2) the corresponding enhanced speech without using CR, and we customize a curriculum learning strategy to define the importance of these negative samples to balance the learning difficulty caused by different similarities between the embeddings of the positive and negative samples. Experiments show that our proposal improves SE performance effectively without introducing additional computation/parameters.
What problem does this paper attempt to address?