Abstract:Dysarthria is a neurological condition resulting from impairments affecting muscle control involved in speech articulation, leading to reduced intelligibility or unintelligible speech, which affects communication abilities. Although Automatic Speech Recognition (ASR) technologies hold the potential to improve the lives of people with dysarthria significantly, ASR systems designed for normal speech have shown limited effectiveness when presented with impaired speech. Consequently, researchers have focused on developing ASR systems specifically tailored for dysarthria. However, progress in this area has been gradual due to the scarcity of dysarthric speech for training and the increased variability of speech among dysarthric individuals, necessitating a larger dataset of dysarthric utterances. One potential solution to enhance the robustness of dysarthric ASR is to deepen the architecture of the acoustic model, which maps the speech signal to words or phonetic units. However, deeper architectures require more training data and pose challenges in dealing with issues such as the vanishing gradient problem and representational bottlenecks in deep learning models. In this study, we expanded on our previous findings and investigated the applications of Depthwise Separable Convolution neurons and the inclusion of Residual Connections to propose a deep dysarthric acoustic model, tackling both vanishing gradients and representational bottleneck issues in dysarthric ASR. Multiple speaker-adaptive dysarthric ASRs were trained and evaluated for 15 UA-Speech dysarthric subjects, then benchmarked against the state-of-the-art and our previous dysarthric ASRs. Our proposed architectures have delivered up to 22.58% word recognition rate (WRR) improvements over the reference models. We observed an average of 10.81% better WRRs over the base traditional dysarthric ASR for all speakers. Likewise, the proposed acoustic model outperformed the state-of-the-art Transformer-based dysarthric ASR reference model for all subjects with mild dysarthria, and up to 14.26% better WRR for the moderate dysarthric subjects was obtained. Our findings indicate the importance of architecture optimization to not only deal with vanishing gradient and representational bottleneck but also maintain the depth of the acoustic model to ensure sufficient model capacity is available to learn intraspeaker variability caused by dysarthria.

Dysarthric speech recognition: an investigation on using depthwise separable convolutions and residual connections

Residual Convolutional Neural Network-Based Dysarthric Speech Recognition

Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning

Deep neural network architectures for dysarthric speech analysis and recognition

Automated Dysarthria Severity Classification: A Study on Acoustic Features and Deep Learning Techniques

Accurate synthesis of Dysarthric Speech for ASR data augmentation

A Strategic Approach for Robust Dysarthric Speech Recognition

Improving Dysarthric Speech Segmentation With Emulated and Synthetic Augmentation

Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction

Use of Speech Impairment Severity for Dysarthric Speech Recognition

A deep learning approach to dysarthric utterance classification with BiLSTM-GRU, speech cue filtering, and log mel spectrograms

A survey of technologies for automatic Dysarthric speech recognition

Improving the Efficiency of Dysarthria Voice Conversion System Based on Data Augmentation

Advancing Voice Biometrics for Dysarthria Speakers Using Multitaper LFCC and Voice Conversion Data Augmentation

Tran-DSR: A hybrid model for dysarthric speech recognition using transformer encoder and ensemble learning

Recent Progress in the CUHK Dysarthric Speech Recognition System

Exploring the influence of general and specific factors on the recognition accuracy of an ASR system for dysarthric speaker

Empowering Dysarthric Speech: Leveraging Advanced LLMs for Accurate Speech Correction and Multimodal Emotion Analysis

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Speaker-Independent Dysarthria Severity Classification using Self-Supervised Transformers and Multi-Task Learning

Acoustic-to-articulatory inversion for dysarthric speech: Are pre-trained self-supervised representations favorable?