Abstract:This paper addresses the training issues associated with neural network-based automatic speech recognition (ASR) under noise conditions. In particular, conventional joint training approaches for a pipeline comprising speech enhancement (SE) and end-to-end ASR model surfer from a conflicting problem and a frame mismatched alignment problem because of different goals and different frame structures for ASR and SE. To mitigate such problems, a knowledge distillation (KD)-based training approach is proposed by interpreting the ASR and SE models in the pipeline as teacher and student models, respectively. In the proposed KD-based training approach, the ASR model is first trained using a training dataset, and then, acoustic tokens are generated via K-means clustering using the latent vectors of the ASR encoder. Thereafter, KD-based training of the SE model is performed using the generated acoustic tokens. The performance of the SE and ASR models is evaluated on two different databases, noisy LibriSpeech and CHiME-4, which correspond to simulated and real-world noise conditions, respectively. The experimental results show that the proposed KD-based training approach yields a lower character error rate (CER) and word error rate (WER) on the two datasets than conventional joint training approaches, including multi-condition training. The results also show that the speech quality scores of the SE model trained using the proposed training approach are higher than those of SE models trained using conventional training approaches. Moreover, the noise reduction scores of the proposed training approach are higher than those of conventional joint training approaches but slightly lower than those of the standalone-SE training approach. Finally, an ablation study is conducted to examine the contribution of different combinations of loss functions in the proposed training approach to SE and ASR performance. The results show that the combination of all loss functions yields the lowest CER and WER and that tokenizer loss contributes more to SE and ASR performance improvement than ASR encoder loss.

Improving Speech Enhancement Using Audio Tagging Knowledge from Pre-Trained Representations and Multi-Task Learning

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Multi-Objective Learning and Mask-Based Post-Processing for Deep Neural Network Based Speech Enhancement

Meta learning based audio tagging.

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

Staged training strategy and multi-activation for audio tagging with noisy and sparse multi-label data

Deep Neural Network Based Noised Asian Speech Enhancement and Its Implementation on a Hearing Aid App.

A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition

Shared Network for Speech Enhancement Based on Multi-Task Learning.

Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR

Injecting Spatial Information for Monaural Speech Enhancement via Knowledge Distillation

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

Improving Monaural Speech Enhancement by Mapping to Fixed Simulation Space With Knowledge Distillation

Semantic Proximity Alignment: Towards Human Perception-consistent Audio Tagging by Aligning with Label Text Description

A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement with Compact Neural Network Architectures

Speech enhancement with weakly labelled data from AudioSet

Enhancing Anti-spoofing Countermeasures Robustness through Joint Optimization and Transfer Learning