Abstract:Audio data augmentation is widely adopted in automatic speech recognition (ASR) to alleviate the overfitting problem. However, noise-based data augmentation converts an over-fitting problem into an under-fitting problem which increases the training time severely. With noise-based data augmentation, informative features are not be persisted during the generating process and generated audio clips would become noise data for the acoustic model. To face the challenge, we propose an Adaptive audio Data Augmentation method called ADA with deep clustering. The proposed ADA could automatically select the most informative augmented sample for each generation. Moreover, two sample selection strategies called RM and RS are proposed. The proposed RM removes samples whose embedding are far away from the cluster center, while the proposed RS maintains the diversity of augmentation samples by sampling in each cluster. Experiments on Aishell-1 demonstrate that the proposed ADA method could improve the data efficiency of end-to-end ASR model in both CNN-based and Transformer-based networks. The proposed ADA obtains an 11.28% and 5.95% relative improvement on SS-CNN and LS-CNN, and a 4.35% improvement on S-Transformer compared with the state-of-the-art audio data augmentation method. Meanwhile, the proposed ADA method decreases the demand of augmented samples by 2.7 times in SS-CNN, LS-CNN and S-Transformer. The qualitative and quantitative analysis proves the effectiveness and efficiency of the proposed ADA method.

SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition

Data Augmentation for End-to-end Code-switching Speech Recognition

A Policy-based Approach to the SpecAugment Method for Low Resource E2E ASR

On Using SpecAugment for End-to-End Speech Translation

Auditory-Based Data Augmentation for End-to-End Automatic Speech Recognition

Semantic Data Augmentation for End-to-End Mandarin Speech Recognition

Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation

Towards Automatic Data Augmentation for Disordered Speech Recognition

Adaptive data augmentation for mandarin automatic speech recognition

SpecMix : A Mixed Sample Data Augmentation method for Training withTime-Frequency Domain Features

Pre-training End-to-end ASR Models with Augmented Speech Samples Queried by Text

A Comparison of Speech Data Augmentation Methods Using S3PRL Toolkit

SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation

Exploring data augmentation in bias mitigation against non-native-accented speech

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

A Light CNN with Split Batch Normalization for Spoofed Speech Detection Using Data Augmentation

Exploring Voice Conversion based Data Augmentation in Text-Dependent Speaker Verification

Speech Recognition with Augmented Synthesized Speech

Data Augmentation for Arabic Speech Recognition Based on End-to-End Deep Learning

Data augmentation using prosody and false starts to recognize non-native children's speech

A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition