Abstract:How to establish a closer relationship between pre-training and downstream task is a valuable question. We argue that task-adaptive pretraining should not just performed before task. For word alignment task, we propose an iterative self-supervised task-adaptive pretraining paradigm, tying together word alignment and self-supervised pretraining by code-switching data augmentation. When we get the aligned pairs predicted by the multilingual contextualized word embeddings, we employ these pairs and origin parallel sentences to synthesize code-switched sentences. Then multilingual models will be continuously finetuned on the augmented code-switched dataset. Finally, finetuned models will be used to produce new aligned pairs. This process will be executed iteratively. Our paradigm is suitable for almost all unsupervised word alignment methods based on multilingual pre-trained LMs and doesn't need gold labeled data, extra parallel data or any other external resources. Experimental results on six language pairs demonstrate that our paradigm can consistently improve baseline method. Compared to resource-rich languages, the improvements on relatively low-resource or different morphological languages are more significant. For example, the AER scores of three different alignment methods based on XLM-R are reduced by about $4 \sim 5$ percentage points on language pair En-Hi.

Filtering Training Corpus and Improving Word Alignment for Statistical Machine Translation

Improve The Statistical Machine Translation Performance By Refining The Word Alignments

Improving domain-specific word alignment for computer assisted translation

Improving Function Word Alignment with Frequency and Syntactic Information.

Improving Statistical Word Alignment with Various Clues.

Iterative Task-adaptive Pretraining for Unsupervised Word Alignment

: Improving Domain-Specific Word Alignment with a General Bilingual Corpus

Improving statistical word alignment with a rule-based machine translation system

Combining Multiple Alignments to Improve Machine Translation.

Word alignment for languages with scarce resources using bilingual corpora of other language pairs

Translation-Based Automatic Alignment of English and Chinese Parallel Corpora

Word Alignment by Fine-tuning Embeddings on Parallel Corpora

TsinghuaAligner: A Statistical Bilingual Word Alignment System

Towards Better Word Alignment in Transformer.

Research of English-Chinese Alignment at Word Granularity on Parallel Corpora

Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages.

A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation

Weighted Alignment Matrices for Statistical Machine Translation.

A Research on Bilingual Dictionary Based Sentence Alignment for Chinese English Parallel Corpus

Search for Discriminative Word Alignment via Dual Decomposition

Transformation from Discontinuous to Continuous Word Alignment Improves Translation Quality