Abstract:Dysarthria is a neurological condition resulting from impairments affecting muscle control involved in speech articulation, leading to reduced intelligibility or unintelligible speech, which affects communication abilities. Although Automatic Speech Recognition (ASR) technologies hold the potential to improve the lives of people with dysarthria significantly, ASR systems designed for normal speech have shown limited effectiveness when presented with impaired speech. Consequently, researchers have focused on developing ASR systems specifically tailored for dysarthria. However, progress in this area has been gradual due to the scarcity of dysarthric speech for training and the increased variability of speech among dysarthric individuals, necessitating a larger dataset of dysarthric utterances. One potential solution to enhance the robustness of dysarthric ASR is to deepen the architecture of the acoustic model, which maps the speech signal to words or phonetic units. However, deeper architectures require more training data and pose challenges in dealing with issues such as the vanishing gradient problem and representational bottlenecks in deep learning models. In this study, we expanded on our previous findings and investigated the applications of Depthwise Separable Convolution neurons and the inclusion of Residual Connections to propose a deep dysarthric acoustic model, tackling both vanishing gradients and representational bottleneck issues in dysarthric ASR. Multiple speaker-adaptive dysarthric ASRs were trained and evaluated for 15 UA-Speech dysarthric subjects, then benchmarked against the state-of-the-art and our previous dysarthric ASRs. Our proposed architectures have delivered up to 22.58% word recognition rate (WRR) improvements over the reference models. We observed an average of 10.81% better WRRs over the base traditional dysarthric ASR for all speakers. Likewise, the proposed acoustic model outperformed the state-of-the-art Transformer-based dysarthric ASR reference model for all subjects with mild dysarthria, and up to 14.26% better WRR for the moderate dysarthric subjects was obtained. Our findings indicate the importance of architecture optimization to not only deal with vanishing gradient and representational bottleneck but also maintain the depth of the acoustic model to ensure sufficient model capacity is available to learn intraspeaker variability caused by dysarthria.

Multi-Modal Acoustic-Articulatory Feature Fusion For Dysarthric Speech Recognition

Exploring Pre-trained Speech Model for Articulatory Feature Extraction in Dysarthric Speech Using ASR

Articulatory Features for ASR of Pathological Speech

Dysarthric speech recognition: an investigation on using depthwise separable convolutions and residual connections

A survey of technologies for automatic Dysarthric speech recognition

Accent Recognition with Hybrid Phonetic Features

CFDRN: A Cognition-Inspired Feature Decomposition and Recombination Network for Dysarthric Speech Recognition.

Accurate synthesis of Dysarthric Speech for ASR data augmentation

Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction

Advancing Voice Biometrics for Dysarthria Speakers Using Multitaper LFCC and Voice Conversion Data Augmentation

A Strategic Approach for Robust Dysarthric Speech Recognition

Use of Speech Impairment Severity for Dysarthric Speech Recognition

Acoustic Model Fusion for End-to-end Speech Recognition

Dysarthria recognition combining speech fusion feature and random forest

Multi-pass Training and Cross-information Fusion for Low-resource End-to-end Accented Speech Recognition

Robust Speech Recognition Combining Cepstral and Articulatory Features

Deep neural network architectures for dysarthric speech analysis and recognition

Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition