Abstract:We propose a multi-target, signal-to-noise-ratio (SNR)-progressive learning (SNR-PL) framework for regression based speech enhancement (SE). At low SNR levels, it is often not easy to directly learn the complicated regression required in SE. We therefore decompose the original SE problem of mapping noisy to clean speech features, with a large SNR gap, into a series of sub-problems, each with a small SNR increment and presumably easier to learn. In our configurations, each hidden layer of the proposed regression neural network is guided to explicitly learn an intermediate target with a specified but small SNR gain. Tested on both deep neural network (DNN) and long short-term memory (LSTM) architectures, SNR-PL consistently outperforms the conventional “black box” DNN framework in terms of both objective measure superiority and network model compactness. Furthermore, with the best configured LSTM-based SNR-PL model, we often observe that the performance is easily saturated or even degraded when increasing the number of intermediate targets, due to the fact that useful information is lost in dimension reduction when involving more target layers. Accordingly, to address this information loss issue, we explore densely connected networks on top of the LSTM structure where the input and the preceding intermediate targets are concatenated together to learn the next target. Finally, to fully utilize the rich and complementary information of intermediate targets, a simple post-processing strategy is adopted to further improve the performance. Evaluated on the simulation speech data, experimental results in unseen noises cases demonstrate that the proposed approach consistently performs better than the conventional LSTM approach in terms of objective speech enhancement measures for speech intelligibility and quality. Furthermore, when evaluated on real data provided by the CHiME-4 Challenge for automatic speech recognition (ASR) of noisy microphone array speech, we show that the proposed approach with intermediate outputs can directly improve the ASR performance, while the conventional LSTM approach increases the word error rate.

Phase-Aware Speech Enhancement with a Recurrent Two Stage Net work

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

PercepNet+: A Phase and SNR Aware PercepNet for Real-Time Speech Enhancement

Deep Neural Network Based Noised Asian Speech Enhancement and Its Implementation on a Hearing Aid App.

Multiple-target Deep Learning for LSTM-RNN Based Speech Enhancement

Magnitude-and-phase-aware Speech Enhancement with Parallel Sequence Modeling

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

A Multi-Target SNR-Progressive Learning Approach to Regression Based Speech Enhancement.

TENET: A Time-reversal Enhancement Network for Noise-robust ASR

Multi-Task Learning U-Net for Single-Channel Speech Enhancement and Mask-Based Voice Activity Detection

Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

Speech Perception Improvement Algorithm Based on a Dual-Path Long Short-Term Memory Network

Deep Time Delay Neural Network for Speech Enhancement with Full Data Learning

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

An RNN-based Speech Enhancement Method for a Binaural Hearing Aid System

Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network

Shared Network for Speech Enhancement Based on Multi-Task Learning.

Stage-Wise and Prior-Aware Neural Speech Phase Prediction

Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks