Abstract:In this paper, we propose a novel teacher-student learning framework for the preprocessing of a speech recognizer, leveraging the online noise tracking capabilities of improved minima controlled recursive averaging IMCRA and deep learning of nonlinear interactions between speech and noise. First, a teacher model with deep architectures is built to learn the target of ideal ratio masks IRMs using simulated training pairs of clean and noisy speech data. Next, a student model is trained to learn an improved speech presence probability by incorporating the estimated IRMs from the teacher model into the IMCRA approach. The student model can be compactly designed in a causal processing mode having no latency with the guidance of a complex and noncausal teacher model. Moreover, the clean speech requirement, which is difficult to meet in real-world adverse environments, can be relaxed for training the student model, implying that noisy speech data can be directly used to adapt the regression-based enhancement model to further improve speech recognition accuracies for noisy speech collected in such conditions. Experiments on the CHiME-4 challenge task show that our best student model with bidirectional gated recurrent units BGRUs can achieve a relative word error rate WER reduction of 18.85% for the real test set when compared to unprocessed system without acoustic model retraining. However, the traditional teacher model degrades the performance of the unprocessed system in this case. In addition, the student model with a deep neural network DNN in causal mode having no latency yields a relative WER reduction of 7.94% over the unprocessed system with 670 times less computing cycles when compared to the BGRU-equipped student model. Finally, the conventional speech enhancement and IRM-based deep learning method destroyed the ASR performance when the recognition system became more powerful. While our proposed approach could still improve the ASR performance even in the more powerful recognition system.

CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments

Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System.

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System

Audio-Visual Efficient Conformer for Robust Speech Recognition

Multimodal Speech Recognition Using EEG and Audio Signals: A Novel Approach for Enhancing ASR Systems

Robust Automatic Speech Recognition via WavAugment Guided Phoneme Adversarial Training

Robust Speaker Recognition with Transformers Using wav2vec 2.0

Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models

Residual Convolutional CTC Networks for Automatic Speech Recognition.

Noise Robust Speech Recognition on Aurora4 by Humans and Machines.

Speaker Adaptation for End-To-End Speech Recognition Systems in Noisy Environments

Prosody Modelling with Pre-trained Cross-utterance Representations for Improved Speech Synthesis

Speech Enhancement Based on Teacher–Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition

Applying Wav2vec2.0 to Speech Recognition in Various Low-resource Languages

Improving child speech recognition with augmented child-like speech

Wavoice: A Noise-resistant Multi-modal Speech Recognition System Fusing mmWave and Audio Signals

Advancing Multi-Accented LSTM-CTC Speech Recognition using a Domain Specific Student-Teacher Learning Paradigm

oboVox Far Field Speaker Recognition: A Novel Data Augmentation Approach with Pretrained Models