Iterative Noisy-Target Approach: Speech Enhancement Without Clean Speech

Yifan Zhang,Wenbin Jiang,Qiongfang Zhuo,Kai Yu
DOI: https://doi.org/10.1007/978-981-97-0601-3_22
2024-01-01
Abstract:Traditional Deep Neural Network based speech enhancement usually requires clean speech as the target of training. However, limited access to ideal clean speech hinders its practical use. Meanwhile, existing self-supervised or unsupervised methods face both unsatisfactory performance and impractical source demand (e.g., various kinds of noises added to the same clean speech). Hence, there’s a significant need to either release the restriction of training data or improve the performance. In this paper, we propose a training strategy that only requires noisy speech and noise waveform. It primarily consists of two phases: 1) With a pair of input and target constructed by adding noise to noisy speech itself for the training of DNN, the first round of training uses noisier speech (noise added to noisy speech) and noisy speech 2) For the following training, using the model trained last time to refine the noisy speech, construct new noisier-noisy pairs for next turn of training. Moreover, to accelerate the process, we apply the iteration into epochs. To evaluate the efficiency, we utilized a dataset including 10 types of real-world noises and made a comparison with two classic supervised and unsupervised methods.
What problem does this paper attempt to address?