Adaptation of Whisper models to child speech recognition

Rishabh Jain,Andrei Barcovschi,Mariam Yiwere,Peter Corcoran,Horia Cucu
2023-07-24
Abstract:Automatic Speech Recognition (ASR) systems often struggle with transcribing child speech due to the lack of large child speech datasets required to accurately train child-friendly ASR models. However, there are huge amounts of annotated adult speech datasets which were used to create multilingual ASR models, such as Whisper. Our work aims to explore whether such models can be adapted to child speech to improve ASR for children. In addition, we compare Whisper child-adaptations with finetuned self-supervised models, such as wav2vec2. We demonstrate that finetuning Whisper on child speech yields significant improvements in ASR performance on child speech, compared to non finetuned Whisper models. Additionally, utilizing self-supervised Wav2vec2 models that have been finetuned on child speech outperforms Whisper finetuning.
Audio and Speech Processing,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily explores the challenges and solutions of Automatic Speech Recognition (ASR) systems in recognizing children's speech. Due to the lack of large-scale children's speech datasets, current ASR systems often perform poorly when transcribing children's speech. To address this issue, researchers have attempted to adapt existing large-scale multilingual ASR models, such as Whisper, to children's speech through fine-tuning, thereby enhancing the recognition performance for children's speech. The paper compares the performance of the Whisper model with the self-supervised learning model wav2vec2 in the task of children's speech recognition. The Whisper model was originally trained on a vast amount of adult speech data and is capable of handling multiple languages. Researchers found that the Whisper model, without fine-tuning, did not perform well in recognizing children's speech. However, after fine-tuning the Whisper model with children's speech data, its recognition performance significantly improved, especially when more children's speech data was used for fine-tuning. In addition, the paper also explores the application of the wav2vec2 model in children's speech recognition. wav2vec2 is a self-supervised learning method for speech representation learning, pre-trained on unlabelled speech data and then fine-tuned on a small labeled dataset for specific downstream tasks. Experimental results show that the wav2vec2 model also performs well in the task of children's speech recognition, and in some cases, even outperforms the fine-tuned Whisper model. Overall, the paper demonstrates that fine-tuning large pre-trained models, such as Whisper or wav2vec2, can effectively improve the performance of ASR systems in recognizing children's speech. Additionally, the paper discusses the impact of different fine-tuning strategies, as well as the role of model size and training data volume on the final recognition performance. These findings are significant for the development of more robust and universal children's speech recognition systems.