Joint Ideal Ratio Mask and Generative Adversarial Networks for Monaural Speech Enhancement

Jing Yuan,Changchun Bao
DOI: https://doi.org/10.1109/icsp.2018.8652276
2018-01-01
Abstract:Speech enhancement is the task of improving some perceptual aspects of noisy speech. Recently, Generative Adversarial Networks (GAN) is becoming a popular deep learning method and different GAN's structures have been proposed [1, 2]. In this paper, we propose a new framework for speech enhancement task by using GAN. We train two models: a generative model G and a discriminative model D. The G and D are both defined by the feedforward multilayer perceptions (MLPs) [3]. The difference between the generator and the discriminator is the generator G employs deep neural network (DNN) based on the masking technique in which the magnitude spectrum of noise and the magnitude spectrum of clean speech are estimated from noisy speech features simultaneously. Meanwhile, the discriminator D uses the MLPS structure to directly predict clean speech magnitude spectrum. The model D discriminates data that comes from clean speech or generated speech by G network. Moreover, in our work, G network is used to perform the speech enhancement. The objective evaluation and experimental results show that the proposed framework significantly improves the performance of traditional deep neural network (DNN) and recent GAN-based speech enhancement methods.
What problem does this paper attempt to address?