Abstract:Deep learning-based speech enhancement approaches like deep neural networks (DNN) and Long Short-Term Memory (LSTM) have already demonstrated superior results to classical methods. However, these methods do not take full advantage of temporal context information. While DNN and LSTM consider temporal context in the noisy source speech, it does not do so for the estimated clean speech. Both DNN and LSTM also have a tendency to over-smooth spectra, which causes the enhanced speech to sound muffled. This paper proposes a novel architecture to address both issues, which we term a conditional generative model (CGM). By adopting an adversarial training scheme applied to a generator of deep dilated convolutional layers, CGM is designed to model the joint and symmetric conditions of both noisy and estimated clean spectra. We evaluate CGM against both DNN and LSTM in terms of Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) on TIMIT sentences corrupted by ITU-T P.501 and NOISEX-92 noise in a range of matched and mismatched noise conditions. Results show that both the CGM architecture and the adversarial training mechanism lead to better PESQ and STOI in all tested noise conditions. In addition to yielding significant improvements in PESQ and STOI, CGM and adversarial training both mitigate against over-smoothing.

Perceptual Loss Function for Speech Enhancement Based on Generative Adversarial Learning

Improve Speech Enhancement Using Perception-High-Related Time-Frequency Loss.

Perceptually Guided Speech Enhancement Using Deep Neural Networks

On Loss Functions and Recurrency Training for GAN-based Speech Enhancement Systems

A consolidated view of loss functions for supervised deep learning-based speech enhancement

Learning with Learned Loss Function: Speech Enhancement with Quality-Net to Improve Perceptual Evaluation of Speech Quality

Restoring Lost Speech Components with Generative Adversarial Networks for Speech Communications in Adverse Conditions

A Conditional Generative Model for Speech Enhancement

MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement

Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework

iMetricGAN: Intelligibility Enhancement for Speech-in-Noise using Generative Adversarial Network-based Metric Learning

Multi-Metric Optimization using Generative Adversarial Networks for Near-End Speech Intelligibility Enhancement

A Human Auditory Perception Loss Function Using Modified Bark Spectral Distortion for Speech Enhancement

MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

Improvement of Packet Loss Concealment for EVS Codec Based on Deep Learning

Perceive and predict: self-supervised speech representation based loss functions for speech enhancement

PAGAN: A Phase-Adapted Generative Adversarial Networks for Speech Enhancement

Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification

Time-domain Speech Enhancement with Generative Adversarial Learning

Perceptual Loss Guided Generative Adversarial Network for Saliency Detection