Abstract:Abstract Lately, the self-attention mechanism has marked a new milestone in the field of automatic speech recognition (ASR). Nevertheless, its performance is susceptible to environmental intrusions as the system predicts the next output symbol depending on the full input sequence and the previous predictions. A popular solution for this problem is adding an independent speech enhancement module as the front-end. Nonetheless, due to being trained separately from the ASR module, the independent enhancement front-end falls into the sub-optimum easily. Besides, the handcrafted loss function of the enhancement module tends to introduce unseen distortions, which even degrade the ASR performance. Inspired by the extensive applications of the generative adversarial networks (GANs) in speech enhancement and ASR tasks, we propose an adversarial joint training framework with the self-attention mechanism to boost the noise robustness of the ASR system. Generally, it consists of a self-attention speech enhancement GAN and a self-attention end-to-end ASR model. There are two advantages which are worth noting in this proposed framework. One is that it benefits from the advancement of both self-attention mechanism and GANs, while the other is that the discriminator of GAN plays the role of the global discriminant network in the stage of the adversarial joint training, which guides the enhancement front-end to capture more compatible structures for the subsequent ASR module and thereby offsets the limitation of the separate training and handcrafted loss functions. With the adversarial joint optimization, the proposed framework is expected to learn more robust representations suitable for the ASR task. We execute systematic experiments on the corpus AISHELL-1, and the experimental results show that on the artificial noisy test set, the proposed framework achieves the relative improvements of 66% compared to the ASR model trained by clean data solely, 35.1% compared to the speech enhancement and ASR scheme without joint training, and 5.3% compared to multi-condition training.

An efficient joint training model for monaural noisy-reverberant speech recognition

Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations

Joint Training of DNNs by Incorporating an Explicit Dereverberation Structure for Distant Speech Recognition

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition

Double Branches and Stages Neural Network for Joint Acoustic Echo and Noise Suppression

Collaborative Joint Training With Multitask Recurrent Model for Speech and Speaker Recognition.

Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition

Joint sparse representation based cepstral-domain dereverberation for distant-talking speech recognition

A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition

Joint Noise and Mask Aware Training for DNN-based Speech Enhancement with SUB-band Features

Transfer Learning for Acoustic Modeling of Noise Robust Speech Recognition

An improved hybrid CTC-Attention model for speech recognition

CACnet: Cube Attentional CNN for Automatic Speech Recognition

Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition

Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR

Joint Training Of Front-End And Back-End Deep Neural Networks For Robust Speech Recognition

Joint compensation of noise and channel in speech recognition

Acoustic Modeling With Dfsmn-Ctc And Joint Ctc-Ce Learning