Abstract:End-to-end speech recognition, such as attention based approaches, is an emerging and attractive topic in recent years. It has achieved comparable performance with the traditional speech recognition framework. Because end-to-end approaches integrate acoustic and linguistic information into one model, the perturbation in the acoustic level such as acoustic noise, could be easily propagated to the linguistic level. Thus improving model robustness in real application environments for these end-to-end systems is crucial. In this paper, in order to make the attention based end-to-end model more robust against noises, we formulate regulation of the objective function with adversarial training examples. Particularly two adversarial regularization techniques, the fast gradient-sign method and the local distributional smoothness method, are explored to improve noise robustness. Experiments on two publicly available Chinese Mandarin corpora, AISHELL-1 and AISHELL-2, show that adversarial regularization is an effective approach to improve robustness against noises for our attention-based models. Specifically, we obtained 18.4% relative character error rate (CER) reduction on the AISHELL-1 noisy test set. Even on the clean test set, we showed 16.7% relative improvement. As the training set increases and covers more environmental varieties, our proposed methods remain effective despite that the improvement shrinks. Training on the large AISHELL-2 training corpus and testing on the various AISHELL-2 test sets, we achieved 7.0%-12.2% relative error rate reduction. To our knowledge, this is the first successful application of adversarial regularization to sequence-to-sequence speech recognition systems.

Improved Regularization Techniques for End-to-End Speech Recognition

Auditory-Based Data Augmentation for End-to-End Automatic Speech Recognition

Adversarial Regularization for Attention Based End-to-End Robust Speech Recognition.

Nonlinear Regularization Decoding Method for Speech Recognition

SCADA: Stochastic, Consistent and Adversarial Data Augmentation to Improve ASR

Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation

Unsupervised Regularization-Based Adaptive Training for Speech Recognition

A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition

You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

On the effectiveness of enrollment speech augmentation for Target Speaker Extraction

Unsupervised Adaptation with Adversarial Dropout Regularization for Robust Speech Recognition

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Adversarial Regularization for End-to-end Robust Speaker Verification

Improving speech recognition using data augmentation and acoustic model fusion

Semantic Data Augmentation for End-to-End Mandarin Speech Recognition

Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition

Improved Noisy Student Training for Automatic Speech Recognition

Improving Speech Recognition with Augmented Synthesized Data and Conditional Model Training

Gradient Regularization for Noise-Robust Speaker Verification

IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION

Performance Improvement on Traditional Chinese Task-Oriented Dialogue Systems With Reinforcement Learning and Regularized Dropout Technique