Abstract:We propose a multi-target, signal-to-noise-ratio (SNR)-progressive learning (SNR-PL) framework for regression based speech enhancement (SE). At low SNR levels, it is often not easy to directly learn the complicated regression required in SE. We therefore decompose the original SE problem of mapping noisy to clean speech features, with a large SNR gap, into a series of sub-problems, each with a small SNR increment and presumably easier to learn. In our configurations, each hidden layer of the proposed regression neural network is guided to explicitly learn an intermediate target with a specified but small SNR gain. Tested on both deep neural network (DNN) and long short-term memory (LSTM) architectures, SNR-PL consistently outperforms the conventional “black box” DNN framework in terms of both objective measure superiority and network model compactness. Furthermore, with the best configured LSTM-based SNR-PL model, we often observe that the performance is easily saturated or even degraded when increasing the number of intermediate targets, due to the fact that useful information is lost in dimension reduction when involving more target layers. Accordingly, to address this information loss issue, we explore densely connected networks on top of the LSTM structure where the input and the preceding intermediate targets are concatenated together to learn the next target. Finally, to fully utilize the rich and complementary information of intermediate targets, a simple post-processing strategy is adopted to further improve the performance. Evaluated on the simulation speech data, experimental results in unseen noises cases demonstrate that the proposed approach consistently performs better than the conventional LSTM approach in terms of objective speech enhancement measures for speech intelligibility and quality. Furthermore, when evaluated on real data provided by the CHiME-4 Challenge for automatic speech recognition (ASR) of noisy microphone array speech, we show that the proposed approach with intermediate outputs can directly improve the ASR performance, while the conventional LSTM approach increases the word error rate.

Learning Recurrent Neural Network Language Models with Context-Sensitive Label Smoothing for Automatic Speech Recognition

Data Noising as Smoothing in Neural Network Language Models

Learning label smoothing for text classification

Variance regularization of RNNLM for speech recognition

Discriminative method for recurrent neural network language models

MRSLN: A Multimodal Residual Speaker-LSTM Network to alleviate the over-smoothing issue for Emotion Recognition in Conversation

APAM: Adaptive Pre-training and Adaptive Meta Learning in Language Model for Noisy Labels and Long-tailed Learning

Global context-dependent recurrent neural network language model with sparse feature learning

Multiple-target Deep Learning for LSTM-RNN Based Speech Enhancement

Learning with Noisy Labels Via Sparse Regularization

Improving Time Series Classification with Representation Soft Label Smoothing

Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding

Focus on the Target's Vocabulary: Masked Label Smoothing for Machine Translation.

Revisiting Over-Smoothness in Text to Speech

The Role of $n$-gram Smoothing in the Age of Neural Networks

Learning Utterance-level Representations with Label Smoothing for Speech Emotion Recognition

Recurrent Neural Network Language Model with Part-of-speech for Mandarin Speech Recognition.

A Multi-Target SNR-Progressive Learning Approach to Regression Based Speech Enhancement.

Improving Accented Mandarin Speech Recognition by Using Recurrent Neural Network Based Language Model Adaptation

Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks

Cross Entropy versus Label Smoothing: A Neural Collapse Perspective