Abstract:In recent years, the development of pre-trained models has significantly propelled advancements in natural language processing. However, multilingual sequence-to-sequence pretrained language models (Seq2Seq PLMs) are pretrained on a wide range of languages (e.g., 25 languages), yet often finetuned for specific bilingual tasks (e.g., English–German), leading to domain and task discrepancies between pretraining and finetuning stages, which may lead to sub-optimal downstream performance. In this study, we first illustratively reveal such domain and task discrepancies, and then conduct an in-depth investigation into the side effects that these discrepancies may have on both training dynamic and downstream performance. To alleviate those side effects, we introduce a simple and effective code-switching restoration task (namely code-switching finetuning ) into the standard pretrain-finetune pipeline. Specifically, in the first stage, we recast the downstream data as the self-supervised format used for pretraining, in which the denoising signal is the code-switched cross-lingual phrase. Then, the model is finetuned on downstream task as usual in the second stage. Experiments spanning both natural language generation (12 supervised translations, 30 zero-shot translations, and 2 cross-lingual summarization tasks) and understanding (7 cross-lingual natural language inference tasks) tasks demonstrate that our model consistently and significantly surpasses the standard finetuning strategy. Analyses show that our method introduces negligible computational cost and reduces cross-lingual representation gaps. We have made the code publicly available at: https://github.com/zanchangtong/CSR4mBART .

Code-Switching for Enhancing NMT with Pre-Specified Translation

Checks and Strategies for Enabling Code-Switched Machine Translation

Lexical-Constraint-Aware Neural Machine Translation via Data Augmentation

Code-switching finetuning: Bridging multilingual pretrained language models for enhanced cross-lingual performance

Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study

Data Augmentation for End-to-end Code-switching Speech Recognition

Code-Switching Text Generation and Injection in Mandarin-English ASR

CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units

Code-switching Sentence Generation by Generative Adversarial Networks and its Application to Data Augmentation

Training Code-Switching Language Model with Monolingual Data.

CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP

A Scenario-Generic Neural Machine Translation Data Augmentation Method

Data Augmentation for Code-Switch Language Modeling by Fusing Multiple Text Generation Methods.

Minimum word error training for non-autoregressive Transformer-based code-switching ASR

Handling Syntactic Divergence in Low-resource Machine Translation

Pre-training for Speech Translation: CTC Meets Optimal Transport

AdvAug: Robust Adversarial Augmentation for Neural Machine Translation

Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-Switching

Adversarial Training for Unknown Word Problems in Neural Machine Translation

Integrating Vectorized Lexical Constraints for Neural Machine Translation

AdMix: A Mixed Sample Data Augmentation Method for Neural Machine Translation