Abstract:This paper focuses on leveraging deep representation learning (DRL) for speech enhancement (SE). In general, the performance of the deep neural network (DNN) is heavily dependent on the learning of data representation. However, the DRL's importance is often ignored in many DNN-based SE algorithms. To obtain a higher quality enhanced speech, we propose a two-stage DRL-based SE method through adversarial training. In the first stage, we disentangle different latent variables because disentangled representations can help DNN generate a better enhanced speech. Specifically, we use the $\beta$-variational autoencoder (VAE) algorithm to obtain the speech and noise posterior estimations and related representations from the observed signal. However, since the posteriors and representations are intractable and we can only apply a conditional assumption to estimate them, it is difficult to ensure that these estimations are always pretty accurate, which may potentially degrade the final accuracy of the signal estimation. To further improve the quality of enhanced speech, in the second stage, we introduce adversarial training to reduce the effect of the inaccurate posterior towards signal reconstruction and improve the signal estimation accuracy, making our algorithm more robust for the potentially inaccurate posterior estimations. As a result, better SE performance can be achieved. The experimental results indicate that the proposed strategy can help similar DNN-based SE algorithms achieve higher short-time objective intelligibility (STOI), perceptual evaluation of speech quality (PESQ), and scale-invariant signal-to-distortion ratio (SI-SDR) scores. Moreover, the proposed algorithm can also outperform recent competitive SE algorithms.

Time Domain Speech Enhancement Using Self-Attention-Based Subspace Projection

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Speech Enhancement Using Self-Supervised Pre-Trained Model and Vector Quantization

A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition

VSANet: Real-time Speech Enhancement Based on Voice Activity Detection and Causal Spatial Attention

Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

TENET: A Time-reversal Enhancement Network for Noise-robust ASR

A Deep Representation Learning-based Speech Enhancement Method Using Complex Convolution Recurrent Variational Autoencoder

Efficient Decoding Self-Attention for End-to-end Speech Synthesis

Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder

LiSenNet: Lightweight Sub-band and Dual-Path Modeling for Real-Time Speech Enhancement

Cross-domain Single-channel Speech Enhancement Model with Bi-projection Fusion Module for Noise-robust ASR

Unseen Noise Estimation Using Separable Deep Auto Encoder for Speech Enhancement

A speech enhancement model based on noise component decomposition: Inspired by human cognitive behavior

Noise Adaptive Speech Enhancement Using Domain Adversarial Training.

Toward Universal Speech Enhancement for Diverse Input Conditions

A Two-Stage Deep Representation Learning-Based Speech Enhancement Method Using Variational Autoencoder and Adversarial Training

Time domain speech enhancement with CNN and time-attention transformer

High Fidelity Speech Enhancement with Band-split RNN

Self-Supervised Learning for Speech Enhancement through Synthesis