Abstract:Dysarthria, a motor speech disorder that impacts articulation and speech clarity, presents significant challenges for Automatic Speech Recognition (ASR) systems. This study proposes a groundbreaking approach to enhance the accuracy of Dysarthric Speech Recognition (DSR). A primary innovation lies in the integration of the SepFormer-Speech Enhancement Generative Adversarial Network (S-SEGAN), an advanced generative adversarial network tailored for Dysarthric Speech Enhancement (DSE), as a front-end processing stage for DSR systems. The S-SEGAN integrates SEGAN's adversarial learning with SepFormer speech separation capabilities, demonstrating significant improvements in performance. Furthermore, a multistage transfer learning approach is employed to assess the DSR models for both word-level and sentence-level DSR. These DSR models are first trained on a large speech dataset (LibriSpeech) and then fine-tuned on dysarthric speech data (both isolated and augmented). Evaluations demonstrate significant DSR accuracy improvements in DSE integration. The Dysarthric Speech (DS)-baseline models (without DSE), Transformer and Conformer achieved Word Recognition Accuracy (WRA) percentages of 68.60% and 69.87%, respectively. The introduction of Hierarchical Attention Network (HAN) with the Transformer and Conformer architectures resulted in improved performance, with T-HAN achieving a WRA of 71.07% and C-HAN reaching 73%. The Transformer model with DSE + DSR for isolated words achieves a WRA of 73.40%, while that of the Conformer model reaches 74.33%. Notably, the T-HAN and C-HAN models with DSE + DSR demonstrate even more substantial enhancements, with WRAs of 75.73% and 76.87%, respectively. Augmenting words further boosts model performance, with the Transformer and Conformer models achieving WRAs of 76.47% and 79.20%, respectively. Remarkably, the T-HAN and C-HAN models with DSE + DSR and augmented words exhibit WRAs of 82.13% and 84.07%, respectively, with C-HAN displaying the highest performance among all proposed models.

Distributed Submodular Maximization for Large Vocabulary Continuous Speech Recognition

Integrating Lattice-Free MMI into End-to-End Speech Recognition

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing

Advancing Multi-talker ASR Performance with Large Language Models

Parallel And Hierarchical Decision Making For Sparse Coding In Speech Recognition

Multi-Stream Posterior Features and Combining Subspace Gmms for Low Resource Lvcsr

DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module

Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Master-ASR: Achieving Multilingual Scalability and Low-Resource Adaptation in ASR with Modular Learning

Training data selection for acoustic modeling via submodular optimization of joint kullback-leibler divergence

Semi-continuous Segmental Probability Modeling for Continuous Speech Recognition.

Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning

SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

A Cluster-Based Multiple Deep Neural Networks Method for Large Vocabulary Continuous Speech Recognition

Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition: A comparison of current training strategies

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

ON MODULAR TRAINING OF NEURAL ACOUSTICS-TO-WORD MODEL FOR LVCSR