Abstract:The cross-domain performance of automatic speech recognition (ASR) could be severely hampered due to the mismatch between training and testing distributions. Since the target domain usually lacks labeled data, and domain shifts exist at acoustic and linguistic levels, it is challenging to perform unsupervised domain adaptation (UDA) for ASR. Previous work has shown that self-supervised learning (SSL) or pseudo-labeling (PL) is effective in UDA by exploiting the self-supervisions of unlabeled data. However, these self-supervisions also face performance degradation in mismatched domain distributions, which previous work fails to address. This work presents a systematic UDA framework to fully utilize the unlabeled data with self-supervision in the pre-training and fine-tuning paradigm. On the one hand, we apply continued pre-training and data replay techniques to mitigate the domain mismatch of the SSL pre-trained model. On the other hand, we propose a domain-adaptive fine-tuning approach based on the PL technique with three unique modifications: Firstly, we design a dual-branch PL method to decrease the sensitivity to the erroneous pseudo-labels; Secondly, we devise an uncertainty-aware confidence filtering strategy to improve pseudo-label correctness; Thirdly, we introduce a two-step PL approach to incorporate target domain linguistic knowledge, thus generating more accurate target domain pseudo-labels. Experimental results on various cross-domain scenarios demonstrate that the proposed approach effectively boosts the cross-domain performance and significantly outperforms previous approaches.

Progressive Multi-scale Self-supervised Learning for Speech Recognition

Progressive Residual Extraction based Pre-training for Speech Representation Learning

Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction

IMPROVING MULTIMODAL SPEECH ENHANCEMENT BY INCORPORATING SELF-SUPERVISED AND CURRICULUM LEARNING

An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions

General-Purpose Speech Representation Learning through a Self-Supervised Multi-Granularity Framework

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks

SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition

Boosting Cross-Domain Speech Recognition with Self-Supervision

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

Progressive Multi-Target Network Based Speech Enhancement with Snr-Preselection for Robust Speaker Diarization

Federated Self-Learning with Weak Supervision for Speech Recognition

A Progressive Learning Approach to Adaptive Noise and Speech Estimation for Speech Enhancement and Noisy Speech Recognition.

A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition

Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Multi-Stage Progressive Speech Enhancement Network

Consistency Based Unsupervised Self-training For ASR Personalisation

Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification