Abstract:This paper addresses the training issues associated with neural network-based automatic speech recognition (ASR) under noise conditions. In particular, conventional joint training approaches for a pipeline comprising speech enhancement (SE) and end-to-end ASR model surfer from a conflicting problem and a frame mismatched alignment problem because of different goals and different frame structures for ASR and SE. To mitigate such problems, a knowledge distillation (KD)-based training approach is proposed by interpreting the ASR and SE models in the pipeline as teacher and student models, respectively. In the proposed KD-based training approach, the ASR model is first trained using a training dataset, and then, acoustic tokens are generated via K-means clustering using the latent vectors of the ASR encoder. Thereafter, KD-based training of the SE model is performed using the generated acoustic tokens. The performance of the SE and ASR models is evaluated on two different databases, noisy LibriSpeech and CHiME-4, which correspond to simulated and real-world noise conditions, respectively. The experimental results show that the proposed KD-based training approach yields a lower character error rate (CER) and word error rate (WER) on the two datasets than conventional joint training approaches, including multi-condition training. The results also show that the speech quality scores of the SE model trained using the proposed training approach are higher than those of SE models trained using conventional training approaches. Moreover, the noise reduction scores of the proposed training approach are higher than those of conventional joint training approaches but slightly lower than those of the standalone-SE training approach. Finally, an ablation study is conducted to examine the contribution of different combinations of loss functions in the proposed training approach to SE and ASR performance. The results show that the combination of all loss functions yields the lowest CER and WER and that tokenizer loss contributes more to SE and ASR performance improvement than ASR encoder loss.

Improving Audio-visual Speech Recognition Performance with Cross-modal Student-teacher Training

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Speech Enhancement Based on Teacher–Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

Robust Audio-visual Speech Recognition Using Bimodal Dfsmn with Multi-condition Training and Dropout Regularization.

Audio-visual multi-channel speech separation, dereverberation and recognition

Transfer Learning for Acoustic Modeling of Noise Robust Speech Recognition

Audio-Visual Information Fusion Using Cross-Modal Teacher-Student Learning for Voice Activity Detection in Realistic Environments

An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario

Improving Speech Recognition Performance in Noisy Environments by Enhancing Lip Reading Accuracy

Single-channel Speech Enhancement Student under Multi-channel Speech Enhancement Teacher

Joint Training of DNNs by Incorporating an Explicit Dereverberation Structure for Distant Speech Recognition

NResNet: nested residual network based on channel and frequency domain attention mechanism for speaker verification in classroom

Deep Multimodal Learning for Audio-Visual Speech Recognition

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Multi-Task Joint Learning for Embedding Aware Audio-Visual Speech Enhancement

Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition

Improving noisy student training for low-resource languages in End-to-End ASR using CycleGAN and inter-domain losses