Abstract:We propose a multi-target, signal-to-noise-ratio (SNR)-progressive learning (SNR-PL) framework for regression based speech enhancement (SE). At low SNR levels, it is often not easy to directly learn the complicated regression required in SE. We therefore decompose the original SE problem of mapping noisy to clean speech features, with a large SNR gap, into a series of sub-problems, each with a small SNR increment and presumably easier to learn. In our configurations, each hidden layer of the proposed regression neural network is guided to explicitly learn an intermediate target with a specified but small SNR gain. Tested on both deep neural network (DNN) and long short-term memory (LSTM) architectures, SNR-PL consistently outperforms the conventional “black box” DNN framework in terms of both objective measure superiority and network model compactness. Furthermore, with the best configured LSTM-based SNR-PL model, we often observe that the performance is easily saturated or even degraded when increasing the number of intermediate targets, due to the fact that useful information is lost in dimension reduction when involving more target layers. Accordingly, to address this information loss issue, we explore densely connected networks on top of the LSTM structure where the input and the preceding intermediate targets are concatenated together to learn the next target. Finally, to fully utilize the rich and complementary information of intermediate targets, a simple post-processing strategy is adopted to further improve the performance. Evaluated on the simulation speech data, experimental results in unseen noises cases demonstrate that the proposed approach consistently performs better than the conventional LSTM approach in terms of objective speech enhancement measures for speech intelligibility and quality. Furthermore, when evaluated on real data provided by the CHiME-4 Challenge for automatic speech recognition (ASR) of noisy microphone array speech, we show that the proposed approach with intermediate outputs can directly improve the ASR performance, while the conventional LSTM approach increases the word error rate.

A Progressive Deep Learning Approach to Child Speech Separation

A Study of Child Speech Extraction Using Joint Speech Enhancement and Separation in Realistic Conditions

A Novel LSTM-Based Speech Preprocessor for Speaker Diarization in Realistic Mismatch Conditions.

Using Iterative Adaptation and Dynamic Mask for Child Speech Extraction under Real-World Multilingual Conditions

A Deep Analysis of Speech Separation Guided Diarization Under Realistic Conditions

Supervised Speech Separation Based on Deep Learning: An Overview

Densely Connected Progressive Learning For Lstm-Based Speech Enhancement

A LSTM-Based Joint Progressive Learning Framework for Simultaneous Speech Dereverberation and Denoising

Progressive Learning for Stabilizing Label Selection in Speech Separation with Mapping-based Method

Low-Latency Deep Clustering For Speech Separation

A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks.

Multiple-target Deep Learning for LSTM-RNN Based Speech Enhancement

A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition

Deep Learning Based Speech Separation Via NMF-Style Reconstructions.

Speech Separation with Pretrained Frontend to Minimize Domain Mismatch

Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading

Geometry Constrained Progressive Learning for Lstm-Based Speech Enhancement

Progressive Multi-Target Network Based Speech Enhancement with Snr-Preselection for Robust Speaker Diarization

Unsupervised Single-Channel Speech Separation Via Deep Neural Network for Different Gender Mixtures

CSLNSpeech: solving the extended speech separation problem with the help of Chinese Sign Language

A Multi-Target SNR-Progressive Learning Approach to Regression Based Speech Enhancement.