How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

Shih-Heng Wang,Zih-Ching Chen,Jiatong Shi,Ming-To Chuang,Guan-Ting Lin,Kuan-Po Huang,David Harwath,Shang-Wen Li,Hung-yi Lee
2024-11-27
Abstract:The utilization of speech Self-Supervised Learning (SSL) models achieves impressive performance on Automatic Speech Recognition (ASR). However, in low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors comes with poor performance. To handle these issues, we extend a conventional efficient fine-tuning scheme based on the adapter. We add an extra intermediate adaptation to warm up the adapter and downstream model initialization. Remarkably, we update only 1-5% of the total model parameters to achieve the adaptation. Experimental results on the ML-SUPERB dataset show that our solution outperforms conventional efficient fine-tuning. It achieves up to a 28% relative improvement in the Character/Phoneme error rate when adapting to unseen languages.
Sound,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the domain - mismatch problem encountered by self - supervised learning (SSL) models when adapting to unseen languages in low - resource language scenarios. Specifically, most existing SSL models are pre - trained on high - resource languages (such as English). When these models are applied to automatic speech recognition (ASR) tasks in low - resource languages, due to differences in language domains, the performance of the models is often poor. Traditional solutions, such as fine - tuning SSL models or using them as feature extractors, respectively have the problems of high computational cost and poor performance. To address these problems, the paper proposes an efficient method to warm up the initialization of adapters and downstream models by introducing an additional intermediate adaptation (IA) step. This method can effectively improve the model's adaptability to unseen languages by only updating 1 - 5% of the total model parameters. Experimental results show that on the ML - SUPERB dataset, this method can achieve a relative improvement of up to 28% in character/phoneme error rate compared to traditional and efficient fine - tuning methods.