Abstract:The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this end, this paper proposes two novel data-efficient methods to learn homogeneous dysarthric and elderly speaker-level features for rapid, on-the-fly test-time adaptation of DNN/TDNN and Conformer ASR models. These include: 1) speaker-level variance-regularized spectral basis embedding (VR-SBE) features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation; and 2) feature-based learning hidden unit contributions (f-LHUC) transforms that are conditioned on VR-SBE features. Experiments are conducted on four tasks across two languages: the English UASpeech and TORGO dysarthric speech datasets, the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora. The proposed on-the-fly speaker adaptation techniques consistently outperform baseline iVector and xVector adaptation by statistically significant word or character error rate reductions up to 5.32% absolute (18.57% relative) and batch-mode LHUC speaker adaptation by 2.24% absolute (9.20% relative), while operating with real-time factors speeding up to 33.6 times against xVectors during adaptation. The efficacy of the proposed adaptation techniques is demonstrated in a comparison against current ASR technologies including SSL pre-trained systems on UASpeech, where our best system produces a state-of-the-art WER of 23.33%. Analyses show VR-SBE features and f-LHUC transforms are insensitive to speaker-level data quantity in testtime adaptation. T-SNE visualization reveals they have stronger speaker-level homogeneity than baseline iVectors, xVectors and batch-mode LHUC transforms.

Text Adaptation for Speaker Verification with Speaker-Text Factorized Embeddings.

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Phoneme-Aware Adaptation with Discrepancy Minimization and Dynamically-Classified Vector for Text-independent Speaker Verification

Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation

A text-dependent speaker verification application framework based on Chinese numerical string corpus

Phoneme Dependent Speaker Embedding And Model Factorization For Multi-Speaker Speech Synthesis And Adaptation

Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems

A framework of text-dependent speaker verification for chinese numerical string corpus

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

The SJTU System for Short-duration Speaker Verification Challenge 2021

On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition

Phoneme-aware and Channel-wise Attentive Learning for Text DependentSpeaker Verification

Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis

Adversarial Speaker Verification.

A Robust Speaker-Adaptive and Text-Prompted Speaker Verification System

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

Fast and accurate factorized neural transducer for text adaption of end-to-end speech recognition models

AdaptiveFormer: A Few-shot Speaker Adaptive Speech Synthesis Model Based on FastSpeech2

FA-ExU-Net: the simultaneous training of an embedding extractor and enhancement model for a speaker verification system robust to short noisy utterances

DeltaVLAD: an Efficient Optimization Algorithm to Discriminate Speaker Embedding for Text-Independent Speaker Verification