Semi-supervised Learning for Code-Switching ASR with Large Language Model Filter

Yu Xi,Wen Ding,Kai Yu,Junjie Lai
2024-09-20
Abstract:Code-switching (CS) phenomenon occurs when words or phrases from different languages are alternated in a single sentence. Due to data scarcity, building an effective CS Automatic Speech Recognition (ASR) system remains challenging. In this paper, we propose to enhance CS-ASR systems by utilizing rich unsupervised monolingual speech data within a semi-supervised learning framework, particularly when access to CS data is limited. To achieve this, we establish a general paradigm for applying noisy student training (NST) to the CS-ASR task. Specifically, we introduce the LLM-Filter, which leverages well-designed prompt templates to activate the correction capability of large language models (LLMs) for monolingual data selection and pseudo-labels refinement during NST. Our experiments on the supervised ASRU-CS and unsupervised AISHELL-2 and LibriSpeech datasets show that our method not only achieves significant improvements over supervised and semi-supervised learning baselines for the CS task, but also attains better performance compared with the fully-supervised oracle upper-bound on the CS English part. Additionally, we further investigate the influence of accent on AESRC dataset and demonstrate that our method can get achieve additional benefits when the monolingual data contains relevant linguistic characteristic.
Audio and Speech Processing
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the **data scarcity problem in Code - Switching (CS) Automatic Speech Recognition (ASR) systems**. Specifically, the code - switching phenomenon refers to the alternate use of words or phrases in different languages within the same sentence. Due to the lack of sufficient code - switching data, building an effective CS - ASR system has been a challenge. To solve this problem, the author proposes a new method to enhance the performance of CS - ASR systems in a semi - supervised learning framework by using abundant unlabeled monolingual speech data. ### Main problems and solutions 1. **Data scarcity problem**: - **Problem**: Code - switching data is scarce, and it is difficult to collect a large amount of natural multilingual mixed - language speech data. - **Solution**: Utilize a large amount of unlabeled monolingual data (such as Mandarin and English) to enhance the CS - ASR system through a Semi - Supervised Learning (SSL) framework. In particular, the author introduces the **Noisy Student Training (NST)** and **LLM - Filter** techniques. 2. **Pseudo - label quality improvement**: - **Problem**: The quality of pseudo - labels generated directly from unlabeled data is not high, which may introduce noise and affect the model performance. - **Solution**: Through the LLM - Filter technique, use a Large Language Model (LLM) to screen monolingual data and optimize pseudo - labels. LLM - Filter activates the correction ability of the LLM by designing specific prompt templates, thereby improving the quality of pseudo - labels. 3. **Cross - language adaptability**: - **Problem**: Existing methods perform poorly in handling cross - language code - switching, especially in complex acoustic environments (such as accent changes). - **Solution**: Select high - quality pseudo - labels through LLM - Filter and combine with NST for iterative training, so that the model can better adapt to the conversion between different languages, especially when dealing with accented data. ### Experimental results The experimental results show that this method not only significantly improves the performance of the CS - ASR system, but even exceeds the fully - supervised baseline model in some cases. In addition, when using monolingual data with relevant language features (such as accented English data), this method can further improve performance. ### Summary This paper successfully solves the data scarcity problem in CS - ASR systems by introducing LLM - Filter and NST techniques and significantly improves the model performance. This method is not only applicable to Chinese - English code - switching scenarios, but can also be extended to other language combinations and has broad application prospects.