Abstract:Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via data-intensive parameter fine-tuning is confronted by in-domain data scarcity and mismatch. To this end, this paper explores a series of approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition. These include: a) input feature fusion between standard acoustic frontends and domain fine-tuned SSL speech representations; b) frame-level joint decoding between TDNN systems separately trained using standard acoustic features alone and those with additional domain fine-tuned SSL features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain fine-tuned pre-trained ASR models. In addition, fine-tuned SSL speech features are used in acoustic-to-articulatory (A2A) inversion to construct multi-modal ASR systems. Experiments are conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models and their features consistently outperform the standalone fine-tuned SSL pre-trained models. These systems produced statistically significant WER or CER reductions of 6.53%, 1.90%, 2.04% and 7.97% absolute (24.10%, 23.84%, 10.14% and 31.39% relative) on the four tasks respectively. Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.

Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

Exploring SSL Discrete Tokens for Multilingual ASR

k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning

Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems

Investigating Self-Supervised Learning for Speech Enhancement and Separation

Zipformer: A faster and better encoder for automatic speech recognition

Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition

A Quantitative Approach to Understand Self-Supervised Models as Cross-lingual Feature Extractors

Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition

Contextual Semi-Supervised Learning: An Approach To Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems

Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation

Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition

FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised Learning Features in Robust End-to-end Speech Recognition

CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification

Exploring Federated Self-Supervised Learning for General Purpose Audio Understanding

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

Adapting Self-Supervised Models to Multi-Talker Speech Recognition Using Speaker Embeddings

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition