Abstract:Automatic Speech Recognition (ASR) systems often struggle with transcribing child speech due to the lack of large child speech datasets required to accurately train child-friendly ASR models. However, there are huge amounts of annotated adult speech datasets which were used to create multilingual ASR models, such as Whisper. Our work aims to explore whether such models can be adapted to child speech to improve ASR for children. In addition, we compare Whisper child-adaptations with finetuned self-supervised models, such as wav2vec2. We demonstrate that finetuning Whisper on child speech yields significant improvements in ASR performance on child speech, compared to non finetuned Whisper models. Additionally, utilizing self-supervised Wav2vec2 models that have been finetuned on child speech outperforms Whisper finetuning.

What problem does this paper attempt to address?

The paper primarily explores the challenges and solutions of Automatic Speech Recognition (ASR) systems in recognizing children's speech. Due to the lack of large-scale children's speech datasets, current ASR systems often perform poorly when transcribing children's speech. To address this issue, researchers have attempted to adapt existing large-scale multilingual ASR models, such as Whisper, to children's speech through fine-tuning, thereby enhancing the recognition performance for children's speech. The paper compares the performance of the Whisper model with the self-supervised learning model wav2vec2 in the task of children's speech recognition. The Whisper model was originally trained on a vast amount of adult speech data and is capable of handling multiple languages. Researchers found that the Whisper model, without fine-tuning, did not perform well in recognizing children's speech. However, after fine-tuning the Whisper model with children's speech data, its recognition performance significantly improved, especially when more children's speech data was used for fine-tuning. In addition, the paper also explores the application of the wav2vec2 model in children's speech recognition. wav2vec2 is a self-supervised learning method for speech representation learning, pre-trained on unlabelled speech data and then fine-tuned on a small labeled dataset for specific downstream tasks. Experimental results show that the wav2vec2 model also performs well in the task of children's speech recognition, and in some cases, even outperforms the fine-tuned Whisper model. Overall, the paper demonstrates that fine-tuning large pre-trained models, such as Whisper or wav2vec2, can effectively improve the performance of ASR systems in recognizing children's speech. Additionally, the paper discusses the impact of different fine-tuning strategies, as well as the role of model size and training data volume on the final recognition performance. These findings are significant for the development of more robust and universal children's speech recognition systems.

Adaptation of Whisper models to child speech recognition

Exploring Native and Non-Native English Child Speech Recognition With Whisper

A comparative analysis between Conformer-Transducer, Whisper, and wav2vec2 for improving the child speech recognition

Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. Adults

Evaluation of state-of-the-art ASR Models in Child-Adult Interactions

Adapting OpenAI's Whisper for Speech Recognition on Code-Switch Mandarin-English SEAME and ASRU2019 Datasets

Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models

Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts

Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

Whispy: Adapting STT Whisper Models to Real-Time Environments

Improving Children's Speech Recognition by Fine-tuning Self-supervised Adult Speech Representations

Data Efficient Child-Adult Speaker Diarization with Simulated Conversations

Improving child speech recognition with augmented child-like speech

Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling

Whisper Finetuning on Nepali Language

Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization

Fine-tuning Whisper on Low-Resource Languages for Real-World Applications

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Sparsely Shared LoRA on Whisper for Child Speech Recognition