Abstract:This paper addresses the training issues associated with neural network-based automatic speech recognition (ASR) under noise conditions. In particular, conventional joint training approaches for a pipeline comprising speech enhancement (SE) and end-to-end ASR model surfer from a conflicting problem and a frame mismatched alignment problem because of different goals and different frame structures for ASR and SE. To mitigate such problems, a knowledge distillation (KD)-based training approach is proposed by interpreting the ASR and SE models in the pipeline as teacher and student models, respectively. In the proposed KD-based training approach, the ASR model is first trained using a training dataset, and then, acoustic tokens are generated via K-means clustering using the latent vectors of the ASR encoder. Thereafter, KD-based training of the SE model is performed using the generated acoustic tokens. The performance of the SE and ASR models is evaluated on two different databases, noisy LibriSpeech and CHiME-4, which correspond to simulated and real-world noise conditions, respectively. The experimental results show that the proposed KD-based training approach yields a lower character error rate (CER) and word error rate (WER) on the two datasets than conventional joint training approaches, including multi-condition training. The results also show that the speech quality scores of the SE model trained using the proposed training approach are higher than those of SE models trained using conventional training approaches. Moreover, the noise reduction scores of the proposed training approach are higher than those of conventional joint training approaches but slightly lower than those of the standalone-SE training approach. Finally, an ablation study is conducted to examine the contribution of different combinations of loss functions in the proposed training approach to SE and ASR performance. The results show that the combination of all loss functions yields the lowest CER and WER and that tokenizer loss contributes more to SE and ASR performance improvement than ASR encoder loss.

Joint Noise and Mask Aware Training for DNN-based Speech Enhancement with SUB-band Features

Dynamic noise aware training for speech enhancement based on deep neural networks.

Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR

Deep Neural Network Based Noised Asian Speech Enhancement and Its Implementation on a Hearing Aid App.

A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition

An Experimental Study on Joint Modeling of Mixed-Bandwidth Data Via Deep Neural Networks for Robust Speech Recognition.

Deep Noise Tracking Network: A Hybrid Signal Processing/Deep Learning Approach to Speech Enhancement

Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations

A regression approach to speech enhancement based on deep neural networks

A Unified DNN Approach to Speaker-Dependent Simultaneous Speech Enhancement and Speech Separation in Low SNR Environments

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

A Novel Training Strategy Using Dynamic Data Generation for Deep Neural Network Based Speech Enhancement.

Joint Training Of Front-End And Back-End Deep Neural Networks For Robust Speech Recognition

A Universal VAD Based on Jointly Trained Deep Neural Networks.

Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition

Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training

Multi-Objective Learning and Mask-Based Post-Processing for Deep Neural Network Based Speech Enhancement

Masking and Inpainting: A Two-Stage Speech Enhancement Approach for Low SNR and Non-Stationary Noise

SNR-Based Features and Diverse Training Data for Robust DNN-Based Speech Enhancement

A Dual Microphone Speech Enhancement Method With A Smoothing Parameter Mask

Joint Training of DNNs by Incorporating an Explicit Dereverberation Structure for Distant Speech Recognition