Abstract:Dysarthria is a neurological condition resulting from impairments affecting muscle control involved in speech articulation, leading to reduced intelligibility or unintelligible speech, which affects communication abilities. Although Automatic Speech Recognition (ASR) technologies hold the potential to improve the lives of people with dysarthria significantly, ASR systems designed for normal speech have shown limited effectiveness when presented with impaired speech. Consequently, researchers have focused on developing ASR systems specifically tailored for dysarthria. However, progress in this area has been gradual due to the scarcity of dysarthric speech for training and the increased variability of speech among dysarthric individuals, necessitating a larger dataset of dysarthric utterances. One potential solution to enhance the robustness of dysarthric ASR is to deepen the architecture of the acoustic model, which maps the speech signal to words or phonetic units. However, deeper architectures require more training data and pose challenges in dealing with issues such as the vanishing gradient problem and representational bottlenecks in deep learning models. In this study, we expanded on our previous findings and investigated the applications of Depthwise Separable Convolution neurons and the inclusion of Residual Connections to propose a deep dysarthric acoustic model, tackling both vanishing gradients and representational bottleneck issues in dysarthric ASR. Multiple speaker-adaptive dysarthric ASRs were trained and evaluated for 15 UA-Speech dysarthric subjects, then benchmarked against the state-of-the-art and our previous dysarthric ASRs. Our proposed architectures have delivered up to 22.58% word recognition rate (WRR) improvements over the reference models. We observed an average of 10.81% better WRRs over the base traditional dysarthric ASR for all speakers. Likewise, the proposed acoustic model outperformed the state-of-the-art Transformer-based dysarthric ASR reference model for all subjects with mild dysarthria, and up to 14.26% better WRR for the moderate dysarthric subjects was obtained. Our findings indicate the importance of architecture optimization to not only deal with vanishing gradient and representational bottleneck but also maintain the depth of the acoustic model to ensure sufficient model capacity is available to learn intraspeaker variability caused by dysarthria.

Wav2vec-based Detection and Severity Level Classification of Dysarthria from Speech

Pre-trained models for detection and severity level classification of dysarthria from speech

Exploring the Impact of Fine-Tuning the Wav2vec2 Model in Database-Independent Detection of Dysarthric Speech

Automated Dysarthria Severity Classification: A Study on Acoustic Features and Deep Learning Techniques

Exploring ASR-Based Wav2Vec2 for Automated Speech Disorder Assessment: Insights and Analysis

Analyzing wav2vec embedding in Parkinson's disease speech: A study on cross-database classification and regression tasks

Voice Disorder Classification Using Wav2vec 2.0 Feature Extraction

Analyzing wav2vec embedding in Parkinson’s disease speech: A study on cross-database classification and regression tasks

Automatic dysarthria detection and severity level assessment using CWT-layered CNN model

Exploring Pathological Speech Quality Assessment with ASR-Powered Wav2Vec2 in Data-Scarce Context

Analyzing Wav2Vec 1.0 Embeddings for Cross-Database Parkinson's Disease Detection and Speech Features Extraction

Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0

Automatic Speech Disfluency Detection Using Wav2vec2.0 for Different Languages with Variable Lengths

A Multi-modal Approach to Dysarthria Detection and Severity Assessment Using Speech and Text Information

Variable STFT Layered CNN Model for Automated Dysarthria Detection and Severity Assessment Using Raw Speech

A deep learning approach to dysarthric utterance classification with BiLSTM-GRU, speech cue filtering, and log mel spectrograms

Dysarthric speech recognition: an investigation on using depthwise separable convolutions and residual connections

Enhancing dysarthria speech feature representation with empirical mode decomposition and Walsh-Hadamard transform

A New Method for Predicting Severity Level of Dysarthric Speech Based on Joint Feature-Sample Selection Using Audio-Visual Data

Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

Automatic classification of neurological voice disorders using wavelet scattering features