Perceptual Loss Function for Speech Enhancement Based on Generative Adversarial Learning

Xin Bai,Xueliang Zhang,Hui Zhang,Haifeng Huang
DOI: https://doi.org/10.23919/apsipaasc55919.2022.9980170
2022-01-01
Abstract:Loss function is important to current deep learning-based speech enhancement. Although the commonly used loss function is minimum mean square error (MSE) between the enhanced speech and the target speech, it cannot accurately reflect the speech quality after noise reduction. Perceptual metrics of speech quality (PESQ) and short-time objective intelligibility (STOI) are usually employed to evaluate speech enhancement algorithms. This paper proposes a generative adversarial architecture that can directly uses PESQ and STOI as training targets for deep learning-based speech enhancement. Specifically, the enhancement network and the assessment network are used as the generator and discriminator in the generative adversarial network (GAN), respectively. In each epoch of adversarial training, we first fix the generator and make the discriminator estimate the true PESQ and STOI of enhanced speech. Then, the discriminator is fixed, and the generator is trained to make the PESQ and STOI of enhanced speech close to those of the target speech. The assessment network is based on Multi-gate Mixture-of-Experts (MMoE), which is suitable for multi-task learning. Compared with using the clean speech spectrum as training target, the proposed method is more effective on speech enhancement.
What problem does this paper attempt to address?