Abstract:Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via data-intensive parameter fine-tuning is confronted by in-domain data scarcity and mismatch. To this end, this paper explores a series of approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition. These include: a) input feature fusion between standard acoustic frontends and domain fine-tuned SSL speech representations; b) frame-level joint decoding between TDNN systems separately trained using standard acoustic features alone and those with additional domain fine-tuned SSL features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain fine-tuned pre-trained ASR models. In addition, fine-tuned SSL speech features are used in acoustic-to-articulatory (A2A) inversion to construct multi-modal ASR systems. Experiments are conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models and their features consistently outperform the standalone fine-tuned SSL pre-trained models. These systems produced statistically significant WER or CER reductions of 6.53%, 1.90%, 2.04% and 7.97% absolute (24.10%, 23.84%, 10.14% and 31.39% relative) on the four tasks respectively. Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.

LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks

SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations

Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models

Efficient infusion of self-supervised representations in Automatic Speech Recognition

Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations

Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition

Progressive Multi-scale Self-supervised Learning for Speech Recognition

PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition

Federated Representation Learning for Automatic Speech Recognition

Label Aware Speech Representation Learning For Language Identification

Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations

Semi-Supervised Learning with Data Augmentation for End-to-End ASR

Investigating Self-Supervised Learning for Speech Enhancement and Separation

Unsupervised Active Learning: Optimizing Labeling Cost-Effectiveness for Automatic Speech Recognition

CA-SSLR: Condition-Aware Self-Supervised Learning Representation for Generalized Speech Processing

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions

Stable Distillation: Regularizing Continued Pre-training for Low-Resource Automatic Speech Recognition

FAT-HuBERT: Front-end Adaptive Training of Hidden-unit BERT for Distortion-Invariant Robust Speech Recognition

Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models

Speech Representation Learning Revisited: The Necessity of Separate Learnable Parameters and Robust Data Augmentation