Improving child speech recognition with augmented child-like speech

Yuanyuan Zhang,Zhengjun Yue,Tanvina Patel,Odette Scharenborg
2024-06-12
Abstract:State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingual child-to-child VC significantly improved child ASR performance. Experiments on the impact of the quantity of child-to-child cross-lingual VC-generated data on fine-tuning (FT) ASR models gave the best results with two-fold augmentation for our FT-Conformer model and FT-Whisper model which reduced WERs with ~3% absolute compared to the baseline, and with six-fold augmentation for the model trained from scratch, which improved by an absolute 3.6% WER. Moreover, using a small amount of "high-quality" VC-generated data achieved similar results to those of our best-FT models.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of poor performance in child speech recognition (CSR). Specifically, existing automatic speech recognition systems (ASR) perform less well when processing child speech than when processing adult and adolescent speech. The main reasons include: 1. **Data scarcity**: The amount of child speech data is far less than that of adult speech, which limits the training and development of child speech recognition models. 2. **High pronunciation variability**: Children's pronunciation has a high degree of variability, and the diversity in the process of language development makes child speech recognition more difficult. 3. **Low - resource problem**: Due to the lack of sufficient training data, deep - learning models perform poorly in child speech recognition. To solve these problems, the paper proposes and studies two methods to generate new child speech data in order to enhance the performance of child speech recognition systems: - **Monolingual Child - to - Child Voice Conversion (VC)**: Use the child speech in the existing database for conversion to generate new child speech content. - **Cross - lingual Child - to - Child Voice Conversion (VC)**: Convert the child speech of one language into the child speech of another language, thereby introducing speech patterns and nuances in different language backgrounds. Through these methods, the paper explores the impact of the quantity and quality of the generated child speech data on the performance of child speech recognition and verifies the effectiveness of these methods. ### Key contributions 1. **First exploration and comparison of monolingual and cross - lingual child - to - child voice conversion**: This is the first time these two methods have been applied to data augmentation in child speech recognition. 2. **Evaluation of the impact of the quantity and quality of generated data on CSR performance**: The study found that using twice the amount of cross - lingual generated data can significantly improve CSR performance, and a small amount of high - quality generated data can also achieve a similar effect. 3. **Improvement of the baseline model for child speech recognition**: By introducing the generated child speech data, the word error rate (WER) is significantly reduced and the recognition accuracy is improved. In conclusion, through innovative voice conversion techniques, this paper effectively solves the problems of data scarcity and poor performance in child speech recognition.