Study of GANs for Noisy Speech Simulation from Clean Speech

Leander Melroy Maben,Zixun Guo,Chen Chen,Utkarsh Chudiwal,Chng Eng Siong

2023-05-21

Abstract:The performance of speech processing models trained on clean speech drops significantly in noisy conditions. Training with noisy datasets alleviates the problem, but procuring such datasets is not always feasible. Noisy speech simulation models that generate noisy speech from clean speech help remedy this issue. In our work, we study the ability of Generative Adversarial Networks (GANs) to simulate a variety of noises. Noise from the Ultra-High-Frequency/Very-High-Frequency (UHF/VHF), additive stationary and non-stationary, and codec distortion categories are studied. We propose four GANs, including the non-parallel translators, SpeechAttentionGAN, SimuGAN, and MaskCycleGAN-Augment, and the parallel translator, Speech2Speech-Augment. We achieved improvements of 55.8%, 28.9%, and 22.8% in terms of Multi-Scale Spectral Loss (MSSL) as compared to the baseline for the RATS, TIMIT-Cabin, and TIMIT-Helicopter datasets, respectively, after training on small datasets of about 3 minutes.

Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the issue of performance degradation in speech processing models in noisy environments. Specifically, the researchers explore the application of Generative Adversarial Networks (GANs) in generating noisy speech from clean speech. Since obtaining large real noisy datasets is not always feasible, the researchers attempt to simulate various types of noise using GANs, including Ultra High Frequency/Very High Frequency (UHF/VHF), additive static and non-static noise, and codec distortion. The study proposes four different GAN models: the non-parallel translation models SpeechAttentionGAN, SimuGAN, and MaskCycleGAN-Augment, as well as the parallel translation model Speech2Speech-Augment. Experimental results show that after training on a small-scale dataset (approximately 3 minutes), these GAN models improve the multi-scale spectral loss (MSSL) by 55.8%, 28.9%, and 22.8% respectively compared to baseline methods. This indicates that GANs can effectively simulate noisy speech under different types of noise conditions, thereby enhancing the robustness of downstream tasks such as speech recognition.

Study of GANs for Noisy Speech Simulation from Clean Speech

Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition

Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification

Data Augmentation using Conditional Generative Adversarial Networks for Robust Speech Recognition

Data augmentation using generative adversarial networks for robust speech recognition.

GANs for Children: A Generative Data Augmentation Strategy for Children Speech Recognition

Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation

MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech

Time-domain Speech Super-resolution with GAN based Modeling for Telephony Speaker Verification

SE-MelGAN -- Speaker Agnostic Rapid Speech Enhancement

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

Generative Adversarial Networks Based Data Augmentation for Noise Robust Speech Recognition

A New GAN-based End-to-End TTS Training Algorithm

Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition

Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection

A Multi-Scale Time-Frequency Spectrogram Discriminator for GAN-based Non-Autoregressive TTS

SCP-GAN: Self-Correcting Discriminator Optimization for Training Consistency Preserving Metric GAN on Speech Enhancement Tasks

MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model

Joint Magnitude Estimation and Phase Recovery Using Cycle-In-Cycle GAN for Non-Parallel Speech Enhancement

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

CGA-MGAN: Metric GAN Based on Convolution-Augmented Gated Attention for Speech Enhancement