Abstract:This paper addresses the training issues associated with neural network-based automatic speech recognition (ASR) under noise conditions. In particular, conventional joint training approaches for a pipeline comprising speech enhancement (SE) and end-to-end ASR model surfer from a conflicting problem and a frame mismatched alignment problem because of different goals and different frame structures for ASR and SE. To mitigate such problems, a knowledge distillation (KD)-based training approach is proposed by interpreting the ASR and SE models in the pipeline as teacher and student models, respectively. In the proposed KD-based training approach, the ASR model is first trained using a training dataset, and then, acoustic tokens are generated via K-means clustering using the latent vectors of the ASR encoder. Thereafter, KD-based training of the SE model is performed using the generated acoustic tokens. The performance of the SE and ASR models is evaluated on two different databases, noisy LibriSpeech and CHiME-4, which correspond to simulated and real-world noise conditions, respectively. The experimental results show that the proposed KD-based training approach yields a lower character error rate (CER) and word error rate (WER) on the two datasets than conventional joint training approaches, including multi-condition training. The results also show that the speech quality scores of the SE model trained using the proposed training approach are higher than those of SE models trained using conventional training approaches. Moreover, the noise reduction scores of the proposed training approach are higher than those of conventional joint training approaches but slightly lower than those of the standalone-SE training approach. Finally, an ablation study is conducted to examine the contribution of different combinations of loss functions in the proposed training approach to SE and ASR performance. The results show that the combination of all loss functions yields the lowest CER and WER and that tokenizer loss contributes more to SE and ASR performance improvement than ASR encoder loss.

Transfer Learning for Acoustic Modeling of Noise Robust Speech Recognition

Speech Enhancement Based on Teacher–Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition

Noisy training for deep neural networks in speech recognition

Improving Audio-visual Speech Recognition Performance with Cross-modal Student-teacher Training

Boosting Noise Robustness of Acoustic Model via Deep Adversarial Training

Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition

An efficient joint training model for monaural noisy-reverberant speech recognition

Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training

Autoregressive Model-Based Robust Speech Recognition in Additive Noise Environment

Listening to the World Improves Speech Command Recognition

Transfer Learning from Whisper for Microscopic Intelligibility Prediction

Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction

Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

Training Multi-Task Adversarial Network for Extracting Noise-Robust Speaker Embedding

Noise-robust voice conversion using adversarial training with multi-feature decoupling

VTS-based Robust Speech Recognition

A noise-robust voice conversion method with controllable background sounds

Transfer Learning Based Progressive Neural Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis.

A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition

Cross-language transfer learning for deep neural network based speech enhancement

Residual Noise Compensation For Robust Speech Recognition In Nonstationary Noise