Abstract:Abstract Lately, the self-attention mechanism has marked a new milestone in the field of automatic speech recognition (ASR). Nevertheless, its performance is susceptible to environmental intrusions as the system predicts the next output symbol depending on the full input sequence and the previous predictions. A popular solution for this problem is adding an independent speech enhancement module as the front-end. Nonetheless, due to being trained separately from the ASR module, the independent enhancement front-end falls into the sub-optimum easily. Besides, the handcrafted loss function of the enhancement module tends to introduce unseen distortions, which even degrade the ASR performance. Inspired by the extensive applications of the generative adversarial networks (GANs) in speech enhancement and ASR tasks, we propose an adversarial joint training framework with the self-attention mechanism to boost the noise robustness of the ASR system. Generally, it consists of a self-attention speech enhancement GAN and a self-attention end-to-end ASR model. There are two advantages which are worth noting in this proposed framework. One is that it benefits from the advancement of both self-attention mechanism and GANs, while the other is that the discriminator of GAN plays the role of the global discriminant network in the stage of the adversarial joint training, which guides the enhancement front-end to capture more compatible structures for the subsequent ASR module and thereby offsets the limitation of the separate training and handcrafted loss functions. With the adversarial joint optimization, the proposed framework is expected to learn more robust representations suitable for the ASR task. We execute systematic experiments on the corpus AISHELL-1, and the experimental results show that on the artificial noisy test set, the proposed framework achieves the relative improvements of 66% compared to the ASR model trained by clean data solely, 35.1% compared to the speech enhancement and ASR scheme without joint training, and 5.3% compared to multi-condition training.

Double Adversarial Network Based Monaural Speech Enhancement for Robust Speech Recognition.

Boosting Noise Robustness of Acoustic Model via Deep Adversarial Training

Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition

Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition

Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition

Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition

Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification

Multi-Metric Optimization using Generative Adversarial Networks for Near-End Speech Intelligibility Enhancement

SEGAN: Speech Enhancement Generative Adversarial Network

CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

A Joint Framework of Denoising Autoencoder and Generative Vocoder for Monaural Speech Enhancement

MAMGAN: Multiscale attention metric GAN for monaural speech enhancement in the time domain

AeGAN: Time-Frequency Speech Denoising via Generative Adversarial Networks

Single-Channel Speech Quality Enhancement in Mobile Networks Based on Generative Adversarial Networks

Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition

CGA-MGAN: Metric GAN Based on Convolution-Augmented Gated Attention for Speech Enhancement

Single Channel Far Field Feature Enhancement For Speaker Verification In The Wild

Single-Channel Speech Enhancement Algorithm Based on ME-MGCRN in Low Signal-to-Noise Scenario

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Unpaired Speech Enhancement by Acoustic and Adversarial Supervision for Speech Recognition

Towards Generalized Speech Enhancement with Generative Adversarial Networks