Study of GANs for Noisy Speech Simulation from Clean Speech

Leander Melroy Maben,Zixun Guo,Chen Chen,Utkarsh Chudiwal,Chng Eng Siong
2023-05-21
Abstract:The performance of speech processing models trained on clean speech drops significantly in noisy conditions. Training with noisy datasets alleviates the problem, but procuring such datasets is not always feasible. Noisy speech simulation models that generate noisy speech from clean speech help remedy this issue. In our work, we study the ability of Generative Adversarial Networks (GANs) to simulate a variety of noises. Noise from the Ultra-High-Frequency/Very-High-Frequency (UHF/VHF), additive stationary and non-stationary, and codec distortion categories are studied. We propose four GANs, including the non-parallel translators, SpeechAttentionGAN, SimuGAN, and MaskCycleGAN-Augment, and the parallel translator, Speech2Speech-Augment. We achieved improvements of 55.8%, 28.9%, and 22.8% in terms of Multi-Scale Spectral Loss (MSSL) as compared to the baseline for the RATS, TIMIT-Cabin, and TIMIT-Helicopter datasets, respectively, after training on small datasets of about 3 minutes.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the issue of performance degradation in speech processing models in noisy environments. Specifically, the researchers explore the application of Generative Adversarial Networks (GANs) in generating noisy speech from clean speech. Since obtaining large real noisy datasets is not always feasible, the researchers attempt to simulate various types of noise using GANs, including Ultra High Frequency/Very High Frequency (UHF/VHF), additive static and non-static noise, and codec distortion. The study proposes four different GAN models: the non-parallel translation models SpeechAttentionGAN, SimuGAN, and MaskCycleGAN-Augment, as well as the parallel translation model Speech2Speech-Augment. Experimental results show that after training on a small-scale dataset (approximately 3 minutes), these GAN models improve the multi-scale spectral loss (MSSL) by 55.8%, 28.9%, and 22.8% respectively compared to baseline methods. This indicates that GANs can effectively simulate noisy speech under different types of noise conditions, thereby enhancing the robustness of downstream tasks such as speech recognition.