Adaptive data augmentation for mandarin automatic speech recognition
Ding, Kai,Xu, Yuelin,Du, Xingyue,Deng, Bin
DOI: https://doi.org/10.1007/s10489-024-05381-6
IF: 5.3
2024-04-25
Applied Intelligence
Abstract:Audio data augmentation is widely adopted in automatic speech recognition (ASR) to alleviate the overfitting problem. However, noise-based data augmentation converts an over-fitting problem into an under-fitting problem which increases the training time severely. With noise-based data augmentation, informative features are not be persisted during the generating process and generated audio clips would become noise data for the acoustic model. To face the challenge, we propose an Adaptive audio Data Augmentation method called ADA with deep clustering. The proposed ADA could automatically select the most informative augmented sample for each generation. Moreover, two sample selection strategies called RM and RS are proposed. The proposed RM removes samples whose embedding are far away from the cluster center, while the proposed RS maintains the diversity of augmentation samples by sampling in each cluster. Experiments on Aishell-1 demonstrate that the proposed ADA method could improve the data efficiency of end-to-end ASR model in both CNN-based and Transformer-based networks. The proposed ADA obtains an 11.28% and 5.95% relative improvement on SS-CNN and LS-CNN, and a 4.35% improvement on S-Transformer compared with the state-of-the-art audio data augmentation method. Meanwhile, the proposed ADA method decreases the demand of augmented samples by 2.7 times in SS-CNN, LS-CNN and S-Transformer. The qualitative and quantitative analysis proves the effectiveness and efficiency of the proposed ADA method.
computer science, artificial intelligence