Abstract:The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this end, this paper proposes two novel data-efficient methods to learn homogeneous dysarthric and elderly speaker-level features for rapid, on-the-fly test-time adaptation of DNN/TDNN and Conformer ASR models. These include: 1) speaker-level variance-regularized spectral basis embedding (VR-SBE) features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation; and 2) feature-based learning hidden unit contributions (f-LHUC) transforms that are conditioned on VR-SBE features. Experiments are conducted on four tasks across two languages: the English UASpeech and TORGO dysarthric speech datasets, the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora. The proposed on-the-fly speaker adaptation techniques consistently outperform baseline iVector and xVector adaptation by statistically significant word or character error rate reductions up to 5.32% absolute (18.57% relative) and batch-mode LHUC speaker adaptation by 2.24% absolute (9.20% relative), while operating with real-time factors speeding up to 33.6 times against xVectors during adaptation. The efficacy of the proposed adaptation techniques is demonstrated in a comparison against current ASR technologies including SSL pre-trained systems on UASpeech, where our best system produces a state-of-the-art WER of 23.33%. Analyses show VR-SBE features and f-LHUC transforms are insensitive to speaker-level data quantity in testtime adaptation. T-SNE visualization reveals they have stronger speaker-level homogeneity than baseline iVectors, xVectors and batch-mode LHUC transforms.

On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

Linguistic Feedback Supports Rapid Adaptation to Acoustically Degraded Speech

Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition

Linguistic-Coupled Age-to-Age Voice Translation to Improve Speech Recognition Performance in Real Environments

Agmma: A Novel Incremental Adaptation Method And Its Application To Speaker Recognition

Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition

Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation

Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition

Use of Speech Impairment Severity for Dysarthric Speech Recognition

Online Speaker Adaptation Using Memory-Aware Networks for Speech Recognition

Online Speaker Adaptation for LVCSR Based on Attention Mechanism

Direct adaptation of hybrid DNN/HMM model for fast speaker adaptation in LVCSR based on speaker code

Rapid Speaker Adaptation Based on D-code Extracted from BLSTM-RNN in LVCSR

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition

Phoneme-Aware Adaptation with Discrepancy Minimization and Dynamically-Classified Vector for Text-independent Speaker Verification

Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition