Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder

Jihyun Mun,Sunhee Kim,Minhwa Chung
2024-08-30
Abstract:Autism Spectrum Disorder (ASD) is a lifelong condition that significantly influencing an individual's communication abilities and their social interactions. Early diagnosis and intervention are critical due to the profound impact of ASD's characteristic behaviors on foundational developmental stages. However, limitations of standardized diagnostic tools necessitate the development of objective and precise diagnostic methodologies. This paper proposes an end-to-end framework for automatically predicting the social communication severity of children with ASD from raw speech data. This framework incorporates an automatic speech recognition model, fine-tuned with speech data from children with ASD, followed by the application of fine-tuned pre-trained language models to generate a final prediction score. Achieving a Pearson Correlation Coefficient of 0.6566 with human-rated scores, the proposed method showcases its potential as an accessible and objective tool for the assessment of ASD.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the issue of assessing the severity of social communication in children with Autism Spectrum Disorder (ASD). Specifically, the paper proposes an end-to-end framework that automatically predicts the severity score of social communication in ASD children using raw speech data. This framework combines Automatic Speech Recognition (ASR) models and Pre-trained Language Models (PLM), and fine-tunes these models to generate the final predicted scores. ### Background and Problem Autism Spectrum Disorder (ASD) is a lifelong condition that severely affects an individual's communication abilities and social interactions. Early diagnosis and intervention are crucial during the foundational developmental stages. However, existing standardized diagnostic tools have numerous limitations, such as a scarcity of professionals, subjective biases, and lengthy assessment processes. Therefore, developing objective and accurate diagnostic methods is particularly urgent. ### Solution The paper proposes an end-to-end framework aimed at achieving automatic assessment through the following steps: 1. **Automatic Speech Recognition (ASR) Model**: Select and fine-tune two multilingual ASR models (wav2vec2 and whisper) to adapt to the speech characteristics of ASD children and typically developing (TD) children. 2. **Pre-trained Language Model (PLM)**: Fine-tune three PLMs (KR-BERT, KLUE/roberta-base, and KR-ELECTRA-Discriminator) using three methods: traditional fine-tuning, manual prompting, and P-tuning. 3. **Ensemble Method**: Use seed ensemble techniques to aggregate the predictions of multiple fine-tuned models, enhancing the robustness and accuracy of the predictions. ### Experiments and Results - **Data Preparation**: Collected speech data from 168 ASD children and 40 TD children for fine-tuning the ASR models and PLMs. - **Experimental Setup**: Included full dataset settings and low-resource settings, using all available training data and 20% of the training data, respectively, for evaluation. - **Evaluation Metrics**: Used Pearson Correlation Coefficient (PCC) to measure the relationship between the model's predicted outputs and human-annotated scores. The experimental results show that the proposed framework performs excellently in predicting the severity of social communication in ASD children, especially in data-limited scenarios. Notably, in low-resource settings, certain combinations (such as the klue/roberta-base model with P-tuning) even outperformed human transcription. ### Discussion - **ASR vs. Human Transcription**: In low-resource settings, ASR transcription performed close to or even better than human transcription, demonstrating its potential in resource-limited situations. - **ASR Model Selection**: Although the whisper model had a lower error rate, the wav2vec2 model performed better in capturing ASD-related speech features. - **PLM and Tuning Methods**: The choice of different PLMs and tuning methods significantly impacted performance, with P-tuning showing outstanding results in certain cases. ### Conclusion The paper proposes an end-to-end framework that fine-tunes ASR models and PLMs to automatically predict the severity of social communication in ASD children from raw speech data. The experimental results indicate that this framework maintains high prediction accuracy even in data-limited scenarios, providing a new tool for early diagnosis and intervention of ASD. Future research will focus on improving the interpretability of the models to ensure their reliability and transparency in clinical applications.