Abstract:Dysarthria is a neurological condition resulting from impairments affecting muscle control involved in speech articulation, leading to reduced intelligibility or unintelligible speech, which affects communication abilities. Although Automatic Speech Recognition (ASR) technologies hold the potential to improve the lives of people with dysarthria significantly, ASR systems designed for normal speech have shown limited effectiveness when presented with impaired speech. Consequently, researchers have focused on developing ASR systems specifically tailored for dysarthria. However, progress in this area has been gradual due to the scarcity of dysarthric speech for training and the increased variability of speech among dysarthric individuals, necessitating a larger dataset of dysarthric utterances. One potential solution to enhance the robustness of dysarthric ASR is to deepen the architecture of the acoustic model, which maps the speech signal to words or phonetic units. However, deeper architectures require more training data and pose challenges in dealing with issues such as the vanishing gradient problem and representational bottlenecks in deep learning models. In this study, we expanded on our previous findings and investigated the applications of Depthwise Separable Convolution neurons and the inclusion of Residual Connections to propose a deep dysarthric acoustic model, tackling both vanishing gradients and representational bottleneck issues in dysarthric ASR. Multiple speaker-adaptive dysarthric ASRs were trained and evaluated for 15 UA-Speech dysarthric subjects, then benchmarked against the state-of-the-art and our previous dysarthric ASRs. Our proposed architectures have delivered up to 22.58% word recognition rate (WRR) improvements over the reference models. We observed an average of 10.81% better WRRs over the base traditional dysarthric ASR for all speakers. Likewise, the proposed acoustic model outperformed the state-of-the-art Transformer-based dysarthric ASR reference model for all subjects with mild dysarthria, and up to 14.26% better WRR for the moderate dysarthric subjects was obtained. Our findings indicate the importance of architecture optimization to not only deal with vanishing gradient and representational bottleneck but also maintain the depth of the acoustic model to ensure sufficient model capacity is available to learn intraspeaker variability caused by dysarthria.

Enhancement of Dysarthric Speech Reconstruction by Contrastive Learning

Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction

Speaker Identity Preservation in Dysarthric Speech Reconstruction by Adversarial Speaker Adaptation

A SEQUENTIAL CONTRASTIVE LEARNING FRAMEWORK FOR ROBUST DYSARTHRIC SPEECH RECOGNITION

Dysarthric speech recognition: an investigation on using depthwise separable convolutions and residual connections

Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning

Arabic Dysarthric Speech Recognition Using Adversarial and Signal-Based Augmentation

Improving Dysarthric Speech Segmentation With Emulated and Synthetic Augmentation

A Strategic Approach for Robust Dysarthric Speech Recognition

Accurate synthesis of Dysarthric Speech for ASR data augmentation

Two-stage and Self-supervised Voice Conversion for Zero-Shot Dysarthric Speech Reconstruction

CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction

Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation

Improving the Efficiency of Dysarthria Voice Conversion System Based on Data Augmentation

Towards reconstructing intelligible speech from the human auditory cortex

Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders

Empowering Dysarthric Speech: Leveraging Advanced LLMs for Accurate Speech Correction and Multimodal Emotion Analysis

MULTI-TASK TRANSFORMER WITH INPUT FEATURE RECONSTRUCTION FOR DYSARTHRIC SPEECH RECOGNITION

Creating Personalized Synthetic Voices from Articulation Impaired Speech Using Augmented Reconstruction Loss

Advancing Voice Biometrics for Dysarthria Speakers Using Multitaper LFCC and Voice Conversion Data Augmentation

Quantifying and Improving the Performance of Speech Recognition Systems on Dysphonic Speech