Abstract:We propose a multi-target, signal-to-noise-ratio (SNR)-progressive learning (SNR-PL) framework for regression based speech enhancement (SE). At low SNR levels, it is often not easy to directly learn the complicated regression required in SE. We therefore decompose the original SE problem of mapping noisy to clean speech features, with a large SNR gap, into a series of sub-problems, each with a small SNR increment and presumably easier to learn. In our configurations, each hidden layer of the proposed regression neural network is guided to explicitly learn an intermediate target with a specified but small SNR gain. Tested on both deep neural network (DNN) and long short-term memory (LSTM) architectures, SNR-PL consistently outperforms the conventional “black box” DNN framework in terms of both objective measure superiority and network model compactness. Furthermore, with the best configured LSTM-based SNR-PL model, we often observe that the performance is easily saturated or even degraded when increasing the number of intermediate targets, due to the fact that useful information is lost in dimension reduction when involving more target layers. Accordingly, to address this information loss issue, we explore densely connected networks on top of the LSTM structure where the input and the preceding intermediate targets are concatenated together to learn the next target. Finally, to fully utilize the rich and complementary information of intermediate targets, a simple post-processing strategy is adopted to further improve the performance. Evaluated on the simulation speech data, experimental results in unseen noises cases demonstrate that the proposed approach consistently performs better than the conventional LSTM approach in terms of objective speech enhancement measures for speech intelligibility and quality. Furthermore, when evaluated on real data provided by the CHiME-4 Challenge for automatic speech recognition (ASR) of noisy microphone array speech, we show that the proposed approach with intermediate outputs can directly improve the ASR performance, while the conventional LSTM approach increases the word error rate.

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Multiple-target Deep Learning for LSTM-RNN Based Speech Enhancement

Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network

A Speech Enhancement Neural Network Architecture with SNR-Progressive Multi-Target Learning for Robust Speech Recognition

Robust Speech Recognition With Speech Enhanced Deep Neural Networks

Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer

Time-Domain Speech Enhancement for Robust Automatic Speech Recognition

Noise Robust Speech Recognition on Aurora4 by Humans and Machines.

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Advanced Recurrent Network-Based Hybrid Acoustic Models for Low Resource Speech Recognition

Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction

A Multi-Target SNR-Progressive Learning Approach to Regression Based Speech Enhancement.

Gated Recurrent Units Based Hybrid Acoustic Models for Robust Speech Recognition

Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM Neural Networks

Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition

Speech enhancement from fused features based on deep neural network and gated recurrent unit network

Single-Channel Speech Enhancement Algorithm Based on ME-MGCRN in Low Signal-to-Noise Scenario

Very Deep Convolutional Neural Networks for Robust Speech Recognition

A New Real-Time Noise Suppression Algorithm for Far-Field Speech Communication Based on Recurrent Neural Network

Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition