How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

Shih-Heng Wang,Zih-Ching Chen,Jiatong Shi,Ming-To Chuang,Guan-Ting Lin,Kuan-Po Huang,David Harwath,Shang-Wen Li,Hung-yi Lee

2024-11-27

Abstract:The utilization of speech Self-Supervised Learning (SSL) models achieves impressive performance on Automatic Speech Recognition (ASR). However, in low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors comes with poor performance. To handle these issues, we extend a conventional efficient fine-tuning scheme based on the adapter. We add an extra intermediate adaptation to warm up the adapter and downstream model initialization. Remarkably, we update only 1-5% of the total model parameters to achieve the adaptation. Experimental results on the ML-SUPERB dataset show that our solution outperforms conventional efficient fine-tuning. It achieves up to a 28% relative improvement in the Character/Phoneme error rate when adapting to unseen languages.

Sound,Computation and Language,Audio and Speech Processing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the domain - mismatch problem encountered by self - supervised learning (SSL) models when adapting to unseen languages in low - resource language scenarios. Specifically, most existing SSL models are pre - trained on high - resource languages (such as English). When these models are applied to automatic speech recognition (ASR) tasks in low - resource languages, due to differences in language domains, the performance of the models is often poor. Traditional solutions, such as fine - tuning SSL models or using them as feature extractors, respectively have the problems of high computational cost and poor performance. To address these problems, the paper proposes an efficient method to warm up the initialization of adapters and downstream models by introducing an additional intermediate adaptation (IA) step. This method can effectively improve the model's adaptability to unseen languages by only updating 1 - 5% of the total model parameters. Experimental results show that on the ML - SUPERB dataset, this method can achieve a relative improvement of up to 28% in character/phoneme error rate compared to traditional and efficient fine - tuning methods.

How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

Adapting Pre-Trained Self-Supervised Learning Model for Speech Recognition with Light-Weight Adapters

An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

Exploring efficient-tuning methods in self-supervised speech models

Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models

Towards Better Domain Adaptation for Self-supervised Models: A Case Study of Child ASR

Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models

Exploiting Adapters for Cross-Lingual Low-Resource Speech Recognition

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition using Agnostic Contrastive Mixup

Fast and Efficient Multilingual Self-Supervised Pre-training for Low-Resource Speech Recognition

Meta-adapter: efficient cross-lingual adaptation with meta-learning

Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition

Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Universal Cross-Lingual Data Generation for Low Resource ASR

Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages

Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR

Efficiently Adapting Pretrained Language Models To New Languages

Boosting Cross-Domain Speech Recognition with Self-Supervision