Evaluation of state-of-the-art ASR Models in Child-Adult Interactions

Aditya Ashvin,Rimita Lahiri,Aditya Kommineni,Somer Bishop,Catherine Lord,Sudarsana Reddy Kadiri,Shrikanth Narayanan
2024-09-24
Abstract:The ability to reliably transcribe child-adult conversations in a clinical setting is valuable for diagnosis and understanding of numerous developmental disorders such as Autism Spectrum Disorder. Recent advances in deep learning architectures and availability of large scale transcribed data has led to development of speech foundation models that have shown dramatic improvements in ASR performance. However, the ability of these models to translate well to conversational child-adult interactions is under studied. In this work, we provide a comprehensive evaluation of ASR performance on a dataset containing child-adult interactions from autism diagnostic sessions, using Whisper, Wav2Vec2, HuBERT, and WavLM. We find that speech foundation models show a noticeable performance drop (15-20% absolute WER) for child speech compared to adult speech in the conversational setting. Then, we employ LoRA on the best performing zero shot model (whisper-large) to probe the effectiveness of fine-tuning in a low resource setting, resulting in ~8% absolute WER improvement for child speech and ~13% absolute WER improvement for adult speech.
Audio and Speech Processing,Machine Learning,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the ability to reliably transcribe conversations between children and adults in a clinical environment, especially during the diagnosis process of Autism Spectrum Disorder (ASR). Despite recent progress in deep - learning architectures and the availability of large - scale transcribed audio data, which has led to a significant improvement in Automatic Speech Recognition (ASR) performance, the performance of these models in handling conversations between children and adults has not been fully studied. Specifically, the paper points out that existing ASR systems perform poorly when dealing with children's voices, especially in natural conversation scenarios. This is mainly due to the significant differences between children's voices and adults' voices, including differences in pitch, language and acoustic features, and pronunciation abilities. In addition, language and communication abnormalities caused by neurodevelopmental disorders such as Autism Spectrum Disorder (ASD) further increase the difficulty of developing child - inclusive ASR systems. To evaluate the performance of the current state - of - the - art ASR models in conversations between children and adults, the paper selected four representative models: Whisper, Wav2Vec2, HuBERT, and WavLM, and conducted a comprehensive evaluation on a dataset containing interactions between children and adults in autism diagnosis sessions. The study found that the performance of these models in processing children's voices decreased by 15 - 20% in absolute Word Error Rate (WER). Subsequently, the paper also explored the effectiveness of improving the best zero - shot model (whisper - large) through the Low - Rank Adaptation (LoRA) technique in a low - resource setting. The results showed that this fine - tuning method can reduce the WER of children's voices by about 8% and the WER of adults' voices by about 13%. In conclusion, this paper aims to fill the gap in existing research on ASR performance evaluation in the scenario of conversations between children and adults, and explore improving the performance of ASR systems in processing children's voices through fine - tuning techniques to support the early diagnosis and intervention of developmental disorders such as autism.