Abstract:This paper focuses on leveraging deep representation learning (DRL) for speech enhancement (SE). In general, the performance of the deep neural network (DNN) is heavily dependent on the learning of data representation. However, the DRL's importance is often ignored in many DNN-based SE algorithms. To obtain a higher quality enhanced speech, we propose a two-stage DRL-based SE method through adversarial training. In the first stage, we disentangle different latent variables because disentangled representations can help DNN generate a better enhanced speech. Specifically, we use the $\beta$-variational autoencoder (VAE) algorithm to obtain the speech and noise posterior estimations and related representations from the observed signal. However, since the posteriors and representations are intractable and we can only apply a conditional assumption to estimate them, it is difficult to ensure that these estimations are always pretty accurate, which may potentially degrade the final accuracy of the signal estimation. To further improve the quality of enhanced speech, in the second stage, we introduce adversarial training to reduce the effect of the inaccurate posterior towards signal reconstruction and improve the signal estimation accuracy, making our algorithm more robust for the potentially inaccurate posterior estimations. As a result, better SE performance can be achieved. The experimental results indicate that the proposed strategy can help similar DNN-based SE algorithms achieve higher short-time objective intelligibility (STOI), perceptual evaluation of speech quality (PESQ), and scale-invariant signal-to-distortion ratio (SI-SDR) scores. Moreover, the proposed algorithm can also outperform recent competitive SE algorithms.

Unseen Noise Estimation Using Separable Deep Auto Encoder for Speech Enhancement

Time Domain Speech Enhancement Using Self-Attention-Based Subspace Projection

DENOISPEECH: DENOISING TEXT TO SPEECH WITH FRAME-LEVEL NOISE MODELING

Unsupervised Speech Enhancement using Dynamical Variational Auto-Encoders

Dynamic noise aware training for speech enhancement based on deep neural networks.

A Two-Stage Deep Representation Learning-Based Speech Enhancement Method Using Variational Autoencoder and Adversarial Training

Unsupervised speech enhancement with deep dynamical generative speech and noise models

A study on attention-based objective function in deep denoising autoencoder based speech enhancement

Environmental Noise Reduction based on Deep Denoising Autoencoder

Noise Estimation Using Mean Square Cross Prediction Error for Speech Enhancement

Speech Enhancement Autoencoder with Hierarchical Latent Structure.

Simultaneous Denoising and Dereverberation Using Deep Embedding Features

Dynamic Noise Embedding: Noise Aware Training and Adaptation for Speech Enhancement

A speech enhancement model based on noise component decomposition: Inspired by human cognitive behavior

A Joint Framework of Denoising Autoencoder and Generative Vocoder for Monaural Speech Enhancement

Deep Noise Tracking Network: A Hybrid Signal Processing/Deep Learning Approach to Speech Enhancement

Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations

A regression approach to speech enhancement based on deep neural networks

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

Noise Adaptive Speech Enhancement Using Domain Adversarial Training.

A Unified Speaker-Dependent Speech Separation and Enhancement System Based on Deep Neural Networks.