Abstract:In this paper, we propose a novel teacher-student learning framework for the preprocessing of a speech recognizer, leveraging the online noise tracking capabilities of improved minima controlled recursive averaging IMCRA and deep learning of nonlinear interactions between speech and noise. First, a teacher model with deep architectures is built to learn the target of ideal ratio masks IRMs using simulated training pairs of clean and noisy speech data. Next, a student model is trained to learn an improved speech presence probability by incorporating the estimated IRMs from the teacher model into the IMCRA approach. The student model can be compactly designed in a causal processing mode having no latency with the guidance of a complex and noncausal teacher model. Moreover, the clean speech requirement, which is difficult to meet in real-world adverse environments, can be relaxed for training the student model, implying that noisy speech data can be directly used to adapt the regression-based enhancement model to further improve speech recognition accuracies for noisy speech collected in such conditions. Experiments on the CHiME-4 challenge task show that our best student model with bidirectional gated recurrent units BGRUs can achieve a relative word error rate WER reduction of 18.85% for the real test set when compared to unprocessed system without acoustic model retraining. However, the traditional teacher model degrades the performance of the unprocessed system in this case. In addition, the student model with a deep neural network DNN in causal mode having no latency yields a relative WER reduction of 7.94% over the unprocessed system with 670 times less computing cycles when compared to the BGRU-equipped student model. Finally, the conventional speech enhancement and IRM-based deep learning method destroyed the ASR performance when the recognition system became more powerful. While our proposed approach could still improve the ASR performance even in the more powerful recognition system.

Error Back Propagation for Sequence Training of Context-Dependent Deep NetworkS for Conversational Speech Transcription

Conversational Speech Transcription Using Context-Dependent Deep Neural Networks

Context-dependent Deep Neural Networks for audio indexing of real-life data

Acceleration Strategies for Speech Recognition Based on Deep Neural Networks

Standalone Training of Context-Dependent Deep Neural Network Acoustic Models

A Cluster-Based Multiple Deep Neural Networks Method for Large Vocabulary Continuous Speech Recognition

Deep Recurrent Convolutional Neural Network: Improving Performance For Speech Recognition

Improving deep neural networks for LVCSR using dropout and shrinking structure

Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers

On Training Recurrent Networks with Truncated Backpropagation Through Time in Speech Recognition

Lattice-Free Sequence Discriminative Training for Phoneme-Based Neural Transducers

State-Clustering Based Multiple Deep Neural Networks Modeling Approach for Speech Recognition

Sequence Training of DNN Acoustic Models With Natural Gradient

Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition

Sequential Dialogue Context Modeling for Spoken Language Understanding

Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition

Decision tree based state tying for speech recognition using DNN derived embeddings

Improving deep neural network acoustic models using unlabeled data

Lattice Based Transcription Loss for End-to-End Speech Recognition

High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model

Speech Enhancement Based on Teacher–Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition