Abstract:We propose a multi-target, signal-to-noise-ratio (SNR)-progressive learning (SNR-PL) framework for regression based speech enhancement (SE). At low SNR levels, it is often not easy to directly learn the complicated regression required in SE. We therefore decompose the original SE problem of mapping noisy to clean speech features, with a large SNR gap, into a series of sub-problems, each with a small SNR increment and presumably easier to learn. In our configurations, each hidden layer of the proposed regression neural network is guided to explicitly learn an intermediate target with a specified but small SNR gain. Tested on both deep neural network (DNN) and long short-term memory (LSTM) architectures, SNR-PL consistently outperforms the conventional “black box” DNN framework in terms of both objective measure superiority and network model compactness. Furthermore, with the best configured LSTM-based SNR-PL model, we often observe that the performance is easily saturated or even degraded when increasing the number of intermediate targets, due to the fact that useful information is lost in dimension reduction when involving more target layers. Accordingly, to address this information loss issue, we explore densely connected networks on top of the LSTM structure where the input and the preceding intermediate targets are concatenated together to learn the next target. Finally, to fully utilize the rich and complementary information of intermediate targets, a simple post-processing strategy is adopted to further improve the performance. Evaluated on the simulation speech data, experimental results in unseen noises cases demonstrate that the proposed approach consistently performs better than the conventional LSTM approach in terms of objective speech enhancement measures for speech intelligibility and quality. Furthermore, when evaluated on real data provided by the CHiME-4 Challenge for automatic speech recognition (ASR) of noisy microphone array speech, we show that the proposed approach with intermediate outputs can directly improve the ASR performance, while the conventional LSTM approach increases the word error rate.

Restorative Speech Enhancement: A Progressive Approach Using SE and Codec Modules

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions

SNR-Progressive Model with Harmonic Compensation for Low-SNR Speech Enhancement

A speech enhancement model based on noise component decomposition: Inspired by human cognitive behavior

Single-channel speech enhancement using improved progressive deep neural network and masking-based harmonic regeneration

Enhancing Anti-spoofing Countermeasures Robustness through Joint Optimization and Transfer Learning

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

A Multi-Target SNR-Progressive Learning Approach to Regression Based Speech Enhancement.

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

Joint Noise Reduction and Listening Enhancement for Full-End Speech Enhancement.

Speech enhancement based on estimating expected values of speech cepstra

SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models

Incorporating Symbolic Sequential Modeling for Speech Enhancement

Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network

A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement with Compact Neural Network Architectures

Dual-Stage Low-Complexity Reconfigurable Speech Enhancement

LiSenNet: Lightweight Sub-band and Dual-Path Modeling for Real-Time Speech Enhancement

Noise-aware Speech Enhancement using Diffusion Probabilistic Model

Speech Enhancement Based on A New Architecture of Wasserstein Generative Adversarial Networks.

NADiffuSE: Noise-aware Diffusion-based Model for Speech Enhancement