Abstract:In this paper, we propose a novel teacher-student learning framework for the preprocessing of a speech recognizer, leveraging the online noise tracking capabilities of improved minima controlled recursive averaging IMCRA and deep learning of nonlinear interactions between speech and noise. First, a teacher model with deep architectures is built to learn the target of ideal ratio masks IRMs using simulated training pairs of clean and noisy speech data. Next, a student model is trained to learn an improved speech presence probability by incorporating the estimated IRMs from the teacher model into the IMCRA approach. The student model can be compactly designed in a causal processing mode having no latency with the guidance of a complex and noncausal teacher model. Moreover, the clean speech requirement, which is difficult to meet in real-world adverse environments, can be relaxed for training the student model, implying that noisy speech data can be directly used to adapt the regression-based enhancement model to further improve speech recognition accuracies for noisy speech collected in such conditions. Experiments on the CHiME-4 challenge task show that our best student model with bidirectional gated recurrent units BGRUs can achieve a relative word error rate WER reduction of 18.85% for the real test set when compared to unprocessed system without acoustic model retraining. However, the traditional teacher model degrades the performance of the unprocessed system in this case. In addition, the student model with a deep neural network DNN in causal mode having no latency yields a relative WER reduction of 7.94% over the unprocessed system with 670 times less computing cycles when compared to the BGRU-equipped student model. Finally, the conventional speech enhancement and IRM-based deep learning method destroyed the ASR performance when the recognition system became more powerful. While our proposed approach could still improve the ASR performance even in the more powerful recognition system.

Sequence Teacher-Student Training of Acoustic Models for Automatic Free Speaking Language Assessment.

Learning Between Different Teacher and Student Models in ASR

A Spoken English Teaching System Based on Speech Recognition and Machine Learning

Semi-Supervised End-to-End ASR Via Teacher-Student Learning with Conditional Posterior Distribution

Error-preserving Automatic Speech Recognition of Young English Learners' Language

Impact of ASR Performance on Free Speaking Language Assessment

Automatic recognition of child speech for robotic applications in noisy environments

A Machine Learning Assessment System for Spoken English Based on Linear Predictive Coding

Progressive unsupervised domain adaptation for ASR using ensemble models and multi-stage training

Adapting an ASR Foundation Model for Spoken Language Assessment

Exploring Retraining-Free Speech Recognition for Intra-sentential Code-Switching

Towards automatic assessment of spontaneous spoken English

ON MODULAR TRAINING OF NEURAL ACOUSTICS-TO-WORD MODEL FOR LVCSR

Spoken English Assessment System for Non-Native Speakers Using Acoustic and Prosodic Features.

Speech Technology for Everyone: Automatic Speech Recognition for Non-Native English with Transfer Learning

Tri-stage training with language-specific encoder and bilingual acoustic learner for code-switching speech recognition

Speech Enhancement Based on Teacher–Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition

Automatic spoken English test for Chinese learners

An ASR-Based Tutor for Learning to Read: How to Optimize Feedback to First Graders

Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

Improving Sequence-to-Sequence Acoustic Modeling by Adding Text-Supervision.