Abstract:This paper addresses the training issues associated with neural network-based automatic speech recognition (ASR) under noise conditions. In particular, conventional joint training approaches for a pipeline comprising speech enhancement (SE) and end-to-end ASR model surfer from a conflicting problem and a frame mismatched alignment problem because of different goals and different frame structures for ASR and SE. To mitigate such problems, a knowledge distillation (KD)-based training approach is proposed by interpreting the ASR and SE models in the pipeline as teacher and student models, respectively. In the proposed KD-based training approach, the ASR model is first trained using a training dataset, and then, acoustic tokens are generated via K-means clustering using the latent vectors of the ASR encoder. Thereafter, KD-based training of the SE model is performed using the generated acoustic tokens. The performance of the SE and ASR models is evaluated on two different databases, noisy LibriSpeech and CHiME-4, which correspond to simulated and real-world noise conditions, respectively. The experimental results show that the proposed KD-based training approach yields a lower character error rate (CER) and word error rate (WER) on the two datasets than conventional joint training approaches, including multi-condition training. The results also show that the speech quality scores of the SE model trained using the proposed training approach are higher than those of SE models trained using conventional training approaches. Moreover, the noise reduction scores of the proposed training approach are higher than those of conventional joint training approaches but slightly lower than those of the standalone-SE training approach. Finally, an ablation study is conducted to examine the contribution of different combinations of loss functions in the proposed training approach to SE and ASR performance. The results show that the combination of all loss functions yields the lowest CER and WER and that tokenizer loss contributes more to SE and ASR performance improvement than ASR encoder loss.

Knowledge Distillation for End-to-End Monaural Multi-talker ASR System

Improving End-to-End Single-Channel Multi-Talker Speech Recognition.

End-to-end Monaural Multi-speaker ASR System Without Pretraining.

Knowledge Distillation from Multilingual and Monolingual Teachers for End-to-End Multilingual Speech Recognition

End-to-end Spoofing Speech Detection and Knowledge Distillation under Noisy Conditions

End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party.

Knowledge Distillation from Multiple Foundation Models for End-to-End Speech Recognition

Mutual-learning Sequence-Level Knowledge Distillation for Automatic Speech Recognition

Learning Contextual Language Embeddings for Monaural Multi-Talker Speech Recognition.

Improving End-to-End Speech Recognition Through Conditional Cross-Modal Knowledge Distillation with Language Model

Injecting Spatial Information for Monaural Speech Enhancement via Knowledge Distillation

End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

Monaural Multi-Talker Speech Recognition With Attention Mechanism And Gated Convolutional Networks

Improving Multi-Speaker ASR With Overlap-Aware Encoding And Monotonic Attention.

Real-time End-to-End Monaural Multi-speaker Speech Recognition

Improving Monaural Speech Enhancement by Mapping to Fixed Simulation Space With Knowledge Distillation

Essence Knowledge Distillation for Speech Recognition

Sub-band Knowledge Distillation Framework for Speech Enhancement

Distilling Knowledge Using Parallel Data for Far-field Speech Recognition

Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition

MIMO-SPEECH: END-TO-END MULTI-CHANNEL MULTI-SPEAKER SPEECH RECOGNITION