Abstract:The task of developing an automatic speaker verification (ASV) system for children's speech is a formidable one due to a number of reasons. The dearth of domain-specific data is one among them. The challenge further intensifies with the introduction of short utterances of speech, a relatively unexplored domain in the case of children's ASV. Voice-based biometric systems suffers miserably when speech data, inadequate both in volume as well as in duration, is used either for enrollment or verification. To circumvent the issue arising due to data scarcity, the work in this paper extensively explores in-domain as well as various out-of-domain data augmentation techniques. A data augmentation approach is proposed that encompasses both in-domain and out-of-domain data augmentation techniques. The in-domain data augmentation approach, incorporates speed perturbation of children's speech. The out-of-domain data used are from adult speakers which are known to have acoustic attributes in stark contrast to child speakers. The acoustic characteristics of the adult speech data in this study are altered on two fronts namely speech waveform modification and feature-level modification, in order to modify the adult acoustic features and render it acoustically similar to children's speech prior to augmentation. While the speech waveform modification involves various signal processing techniques like prosody modification, formant modification and voice-conversion. The feature-level modification on the other hand involves Vocal-tract length normalization technique (VTLN) which explicitly models and compensates for the ill-effects of variations in vocal tract length by linearly warping the frequency axis of speech signals. The proposed data augmentation approach helps not only in increasing the amount of training data but also in effectively capturing the missing target attributes which helps in boosting the verification performance. A relative improvement of 48.01% in equal error rate (EER) with respect to the baseline system is a testimony of it. Furthermore, the conventionally used Mel-frequency cepstral coefficients (MFCC) are known to average out the higher-frequency components. Prior literary works have shown that a significant amount of relevant acoustic information is available in the higher-frequency region of the children's speech. Therefore, effective preservation of higher-frequency contents in children's speech is of paramount importance which must be appropriately tackled for the development of a reliable and robust children's ASV system. In this regard, frame-level concatenation of the MFCC features with the Inverse-Mel-frequency cepstral coefficient (IMFCC) features is undertaken. The feature concatenation of MFCC and IMFCC is carried out with the sole intention of effectively preserving the higher-frequency contents in the children's speech data. The low canonical correlation existing between the MFCC and the IMFCC feature vectors provides the necessary impetus to go with their feature fusion. The feature concatenation approach, when combined with proposed data augmentation, helps in further improvement of the verification performance. The experimental results testify our claims, wherein we have achieved an overall relative reduction of 50.15% for equal error rate.

Spectral Modification Based Data Augmentation For Improving End-to-End ASR For Children's Speech

ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification

Role of Data Augmentation and Effective Conservation of High-Frequency Contents in the Context Children's Speaker Verification System

Experimental studies for improving the performance of children's speaker verification system using short utterances

PDAugment: Data Augmentation by Pitch and Duration Adjustments for Automatic Lyrics Transcription

Exploring the Role of Data Augmentation and Acoustic Feature Concatenation in the Context of Zero-Resource Children's ASR

Auditory-Based Data Augmentation for End-to-End Automatic Speech Recognition

LPC Augment: An LPC-Based ASR Data Augmentation Algorithm for Low and Zero-Resource Children's Dialects

In domain training data augmentation on noise robust Punjabi Children speech recognition

Significance of Data Augmentation for Improving Cleft Lip and Palate Speech Recognition

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Data augmentation using prosody and false starts to recognize non-native children's speech

PhasePerturbation: Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition

Effective preservation of higher-frequency contents in the context of short utterance based children's speaker verification system

Using Data Augmentations and VTLN to Reduce Bias in Dutch End-to-End Speech Recognition Systems

Data Augmentation For Children's Speech Recognition -- The "Ethiopian" System For The SLT 2021 Children Speech Recognition Challenge

Transfer Learning for Robust Low-Resource Children's Speech ASR with Transformers and Source-Filter Warping

Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition

Data Augmentation for End-to-end Code-switching Speech Recognition

Improving child speech recognition with augmented child-like speech

Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis