Exploring the Role of Data Augmentation and Acoustic Feature Concatenation in the Context of Zero-Resource Children's ASR
Ankita,S. Shahnawazuddin
DOI: https://doi.org/10.1007/s00034-024-02896-8
IF: 2.311
2024-11-02
Circuits Systems and Signal Processing
Abstract:Our present work studies the impact of employing out-of-domain data augmentation and front-end acoustic features concatenation on zero-resource children's automatic speech recognition (ASR) task. In such cases, ASR systems are usually trained on adults' speech. However, due to acoustic mismatch between training and test data, the recognition performance degrades significantly. As highlighted by earlier works, differences in fundamental and formant frequencies as well as speaking-rates are the major factors for acoustic mismatch. To enhance the recognition performance, a two-stage approach is proposed in this paper. In the initial phase, an out-of-domain data augmentation technique is implemented, altering adults' speech to sound like that from children. This involves adjusting prosody, modifying formants, and employing voice conversion. By pooling these modified versions, the amount of training data increases 4 times. At the same time, a wider array of targeted acoustic features get introduced into the training set with the goal of enhancing recognition performance. In the second stage, we have fused two front-end acoustic feature vectors at the frame-level to capture a broader spectrum of acoustic details. We have used TANDEM-STRAIGHT Mel-frequency cepstral coefficients (TS-MFCC) as our first front-end feature. To derive the second features, the Mel-filterbank employed in the case of TS-MFCC features was replaced with a Gamma-tone filterbank. These features are, therefore, termed as TS-GTF-CC in this paper. Canonical correlation analysis performed on TS-MFCC and TS-GTF-CC revealed that, certain coefficients of the two feature vectors exhibit low correlation. This, in turn, implies that those are likely to capture acoustic information differently. This observation motivated us to concatenate the two feature vectors. By combining out-of-domain data augmentation strategy with frame-level concatenation of TS-MFCC and TS-GTF-CC features, we have achieved a substantial relative reduction in word error rate.
engineering, electrical & electronic