Abstract:Dysarthria, a motor speech disorder that impacts articulation and speech clarity, presents significant challenges for Automatic Speech Recognition (ASR) systems. This study proposes a groundbreaking approach to enhance the accuracy of Dysarthric Speech Recognition (DSR). A primary innovation lies in the integration of the SepFormer-Speech Enhancement Generative Adversarial Network (S-SEGAN), an advanced generative adversarial network tailored for Dysarthric Speech Enhancement (DSE), as a front-end processing stage for DSR systems. The S-SEGAN integrates SEGAN's adversarial learning with SepFormer speech separation capabilities, demonstrating significant improvements in performance. Furthermore, a multistage transfer learning approach is employed to assess the DSR models for both word-level and sentence-level DSR. These DSR models are first trained on a large speech dataset (LibriSpeech) and then fine-tuned on dysarthric speech data (both isolated and augmented). Evaluations demonstrate significant DSR accuracy improvements in DSE integration. The Dysarthric Speech (DS)-baseline models (without DSE), Transformer and Conformer achieved Word Recognition Accuracy (WRA) percentages of 68.60% and 69.87%, respectively. The introduction of Hierarchical Attention Network (HAN) with the Transformer and Conformer architectures resulted in improved performance, with T-HAN achieving a WRA of 71.07% and C-HAN reaching 73%. The Transformer model with DSE + DSR for isolated words achieves a WRA of 73.40%, while that of the Conformer model reaches 74.33%. Notably, the T-HAN and C-HAN models with DSE + DSR demonstrate even more substantial enhancements, with WRAs of 75.73% and 76.87%, respectively. Augmenting words further boosts model performance, with the Transformer and Conformer models achieving WRAs of 76.47% and 79.20%, respectively. Remarkably, the T-HAN and C-HAN models with DSE + DSR and augmented words exhibit WRAs of 82.13% and 84.07%, respectively, with C-HAN displaying the highest performance among all proposed models.

Residual Convolutional Neural Network-Based Dysarthric Speech Recognition

Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning

A Strategic Approach for Robust Dysarthric Speech Recognition

Deep neural network architectures for dysarthric speech analysis and recognition

UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization

Residual Convolutional CTC Networks for Automatic Speech Recognition.

Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment

Use of Speech Impairment Severity for Dysarthric Speech Recognition

Recent Progress in the CUHK Dysarthric Speech Recognition System

Automated Dysarthria Severity Classification: A Study on Acoustic Features and Deep Learning Techniques

Tran-DSR: A hybrid model for dysarthric speech recognition using transformer encoder and ensemble learning

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Analyzing Large Receptive Field Convolutional Networks for Distant Speech Recognition

Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction

Very Deep Convolutional Neural Networks for Robust Speech Recognition

Variable STFT Layered CNN Model for Automated Dysarthria Detection and Severity Assessment Using Raw Speech

Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech

Two-stage and Self-supervised Voice Conversion for Zero-Shot Dysarthric Speech Reconstruction

Antigenic requirements for T-cell activation: reconstitution of a functional antigen from an inactive peptide portion of an antigen conjugated to protein carriers.

Speech Recognition using Convolution Deep Neural Networks

Monaural Speech Dereverberation using Deformable Convolutional Networks