Abstract:This paper addresses the training issues associated with neural network-based automatic speech recognition (ASR) under noise conditions. In particular, conventional joint training approaches for a pipeline comprising speech enhancement (SE) and end-to-end ASR model surfer from a conflicting problem and a frame mismatched alignment problem because of different goals and different frame structures for ASR and SE. To mitigate such problems, a knowledge distillation (KD)-based training approach is proposed by interpreting the ASR and SE models in the pipeline as teacher and student models, respectively. In the proposed KD-based training approach, the ASR model is first trained using a training dataset, and then, acoustic tokens are generated via K-means clustering using the latent vectors of the ASR encoder. Thereafter, KD-based training of the SE model is performed using the generated acoustic tokens. The performance of the SE and ASR models is evaluated on two different databases, noisy LibriSpeech and CHiME-4, which correspond to simulated and real-world noise conditions, respectively. The experimental results show that the proposed KD-based training approach yields a lower character error rate (CER) and word error rate (WER) on the two datasets than conventional joint training approaches, including multi-condition training. The results also show that the speech quality scores of the SE model trained using the proposed training approach are higher than those of SE models trained using conventional training approaches. Moreover, the noise reduction scores of the proposed training approach are higher than those of conventional joint training approaches but slightly lower than those of the standalone-SE training approach. Finally, an ablation study is conducted to examine the contribution of different combinations of loss functions in the proposed training approach to SE and ASR performance. The results show that the combination of all loss functions yields the lowest CER and WER and that tokenizer loss contributes more to SE and ASR performance improvement than ASR encoder loss.

Does Single-channel Speech Enhancement Improve Keyword Spotting Accuracy? A Case Study

Keyword Spotting for Hearing Assistive Devices Robust to External Speakers

Enhancing Anti-spoofing Countermeasures Robustness through Joint Optimization and Transfer Learning

Noise Estimation Using Mean Square Cross Prediction Error for Speech Enhancement

AB/BA analysis: A framework for estimating keyword spotting recall improvement while maintaining audio privacy

Speech Enhancement for Wake-Up-Word detection in Voice Assistants

DCCRN-KWS: an audio bias based model for noise robust small-footprint keyword spotting

Keyword-Guided Adaptation of Automatic Speech Recognition

Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance

How does end-to-end speech recognition training impact speech enhancement artifacts?

Keyword Spotting Based on Syllable Confusion Network.

Audio Enhancement for Computer Audition—An Iterative Training Paradigm Using Sample Importance

Contrastive Augmentation: An Unsupervised Learning Approach for Keyword Spotting in Speech Technology

Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition

Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Improving Speech Enhancement Using Audio Tagging Knowledge from Pre-Trained Representations and Multi-Task Learning

Improving Design of Input Condition Invariant Speech Enhancement

A Speech Enhancement Algorithm Based on Computational Auditory Scene Analysis

Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments