Prosody features based low resource Punjabi children ASR and T-NT classifier using data augmentation

Virender Kadyan,Taniya Hasija,Amitoj Singh
DOI: https://doi.org/10.1007/s11042-022-13435-5
IF: 2.577
2022-07-20
Multimedia Tools and Applications
Abstract:Automatic children speech recognition is always challenging due to limited corpus and varying acoustic features. One among those is zero speech corpus and large acoustic variability which limits the power of learning of training dataset. To overcome this issue, an effort has been made to build two types of systems: ASR and Tonal-Non tonal (T-NT) classifiers. Initially, robust features are added into the front phase using prosody embedded feature vectors. Various prosody features are combined with MFCC feature vectors which outperformed conventional Mel Frequency Cepstral Coefficients (MFCC) features only. A small reduction in Word Error Rate (WER) is obtain on the original train and test dataset. To further enhance the recognition rate, training data scarcity is remove through two-level augmentation approach: external prosody modifications (using pitch and time scaling parameters) and internal augmentation using speed perturbation approaches (using 3, 4, and 5 way methods). For that purpose, an original and augmented dataset is pooled to learn more statistical parameters information. Significant improvement in the performance of both systems are observe due to two-level augmentations and prosody embedded features. Finally it achieve a relative improvement of 13.1% and 18.3% for ASR and T-NT classifier systems over the baseline system which are processed on a modified train and original test set respectively.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?