Abstract:The ability to reliably transcribe child-adult conversations in a clinical setting is valuable for diagnosis and understanding of numerous developmental disorders such as Autism Spectrum Disorder. Recent advances in deep learning architectures and availability of large scale transcribed data has led to development of speech foundation models that have shown dramatic improvements in ASR performance. However, the ability of these models to translate well to conversational child-adult interactions is under studied. In this work, we provide a comprehensive evaluation of ASR performance on a dataset containing child-adult interactions from autism diagnostic sessions, using Whisper, Wav2Vec2, HuBERT, and WavLM. We find that speech foundation models show a noticeable performance drop (15-20% absolute WER) for child speech compared to adult speech in the conversational setting. Then, we employ LoRA on the best performing zero shot model (whisper-large) to probe the effectiveness of fine-tuning in a low resource setting, resulting in ~8% absolute WER improvement for child speech and ~13% absolute WER improvement for adult speech.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the ability to reliably transcribe conversations between children and adults in a clinical environment, especially during the diagnosis process of Autism Spectrum Disorder (ASR). Despite recent progress in deep - learning architectures and the availability of large - scale transcribed audio data, which has led to a significant improvement in Automatic Speech Recognition (ASR) performance, the performance of these models in handling conversations between children and adults has not been fully studied. Specifically, the paper points out that existing ASR systems perform poorly when dealing with children's voices, especially in natural conversation scenarios. This is mainly due to the significant differences between children's voices and adults' voices, including differences in pitch, language and acoustic features, and pronunciation abilities. In addition, language and communication abnormalities caused by neurodevelopmental disorders such as Autism Spectrum Disorder (ASD) further increase the difficulty of developing child - inclusive ASR systems. To evaluate the performance of the current state - of - the - art ASR models in conversations between children and adults, the paper selected four representative models: Whisper, Wav2Vec2, HuBERT, and WavLM, and conducted a comprehensive evaluation on a dataset containing interactions between children and adults in autism diagnosis sessions. The study found that the performance of these models in processing children's voices decreased by 15 - 20% in absolute Word Error Rate (WER). Subsequently, the paper also explored the effectiveness of improving the best zero - shot model (whisper - large) through the Low - Rank Adaptation (LoRA) technique in a low - resource setting. The results showed that this fine - tuning method can reduce the WER of children's voices by about 8% and the WER of adults' voices by about 13%. In conclusion, this paper aims to fill the gap in existing research on ASR performance evaluation in the scenario of conversations between children and adults, and explore improving the performance of ASR systems in processing children's voices through fine - tuning techniques to support the early diagnosis and intervention of developmental disorders such as autism.

Evaluation of state-of-the-art ASR Models in Child-Adult Interactions

Data Efficient Child-Adult Speaker Diarization with Simulated Conversations

Adaptation of Whisper models to child speech recognition

Improving child speech recognition with augmented child-like speech

Exploring Native and Non-Native English Child Speech Recognition With Whisper

Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. Adults

A comparative analysis between Conformer-Transducer, Whisper, and wav2vec2 for improving the child speech recognition

An Investigation on Applying Acoustic Feature Conversion to ASR of Adult and Child Speech

Robust Self Supervised Speech Embeddings for Child-Adult Classification in Interactions involving Children with Autism

Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models

Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications

Child-adult speech diarization in naturalistic conditions of preschool classrooms using room-independent ResNet model and automatic speech recognition-based re-segmentation

Adapting an ASR Foundation Model for Spoken Language Assessment

Analysing the Impact of Audio Quality on the Use of Naturalistic Long-Form Recordings for Infant-Directed Speech Research

Prosody Usage Optimization for Children Speech Recognition with Zero Resource Children Speech

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions

Towards a Single ASR Model That Generalizes to Disordered Speech

ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification

Child Speech Recognition in Human-Robot Interaction: Problem Solved?

The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models

Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward