Abstract:We present Unicoder, a universal language encoder that is insensitive to different languages. Given an arbitrary NLP task, a model can be trained with Unicoder using training data in one language and directly applied to inputs of the same task in other languages. Comparing to similar efforts such as Multilingual BERT and XLM, three new cross-lingual pre-training tasks are proposed, including cross-lingual word recovery, cross-lingual paraphrase classification and cross-lingual masked language model. These tasks help Unicoder learn the mappings among different languages from more perspectives. We also find that doing fine-tuning on multiple languages together can bring further improvement. Experiments are performed on two tasks: cross-lingual natural language inference (XNLI) and cross-lingual question answering (XQA), where XLM is our baseline. On XNLI, 1.8% averaged accuracy improvement (on 15 languages) is obtained. On XQA, which is a new cross-lingual dataset built by us, 5.5% averaged accuracy improvement (on French and German) is obtained.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use the training data of a single language to improve the performance of other languages in cross - language natural language processing tasks, especially in the absence of annotated data in the target language. Specifically, the paper proposes a general - language encoder named Unicoder, aiming to enhance the model's adaptability to different languages through multi - language pre - training tasks, so as to achieve better performance in tasks such as cross - language natural language inference (XNLI) and cross - language question answering (XQA). ### Core Problems of the Paper 1. **Challenges in Cross - language Tasks**: Most of the existing pre - training models, such as BERT and ELMo, perform poorly when dealing with cross - language tasks because they are mainly trained on data of a single language. When the languages of the training and test data are different, the performance of these models will decline significantly. 2. **Multi - language Pre - training**: To overcome this problem, the paper proposes a new multi - language pre - training method, which enhances the model's cross - language understanding ability by introducing multiple cross - language tasks. These tasks include: - **Cross - lingual Word Recovery**: Learn the word alignment relationships between different languages through the attention mechanism. - **Cross - lingual Paraphrase Classification**: Determine whether two sentences in different languages have the same meaning. - **Cross - lingual Masked Language Model**: Conduct masked language model training on a document containing multiple languages. ### Solutions 1. **Multi - task Pre - training**: Unicoder not only uses the traditional masked language model (MLM) and translation language model (TLM), but also introduces the above three new cross - language tasks. These tasks help the model learn the mapping relationships between different languages from different perspectives. 2. **Multi - language Fine - tuning**: The paper proposes a multi - language fine - tuning strategy, that is, using the data of the source language and the target language simultaneously for training in the fine - tuning stage. This strategy can further improve the performance of the model in cross - language tasks. 3. **Experimental Verification**: The paper conducts experiments on two cross - language tasks, XNLI and XQA, and the results show that Unicoder has achieved a significant performance improvement in these tasks. Especially under the multi - language fine - tuning strategy, the performance of Unicoder is better than that of the existing baseline models. ### Main Contributions 1. **New Cross - language Pre - training Tasks**: Propose three new cross - language pre - training tasks, which are helpful for learning better language - independent encoders. 2. **Construct a Cross - language Question Answering Data Set**: Construct a new cross - language question answering data set XQA, which can be used as a new cross - language benchmark data set. 3. **Multi - language Fine - tuning Strategy**: Verify that multi - language fine - tuning can significantly improve the model performance. 4. **New State - of - the - Art Results**: Achieve new state - of - the - art results on the XNLI data set. ### Experimental Results - **XNLI**: The average accuracy rate on 15 languages has increased by 1.8%. - **XQA**: The average accuracy rate on French and German has increased by 5.5%. Through these improvements, Unicoder performs excellently in cross - language natural language processing tasks and provides an effective method for solving cross - language tasks.

Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

Machine-Created Universal Language for Cross-lingual Transfer

ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

UniCode: Learning a Unified Codebook for Multimodal Large Language Models

Cross-Lingual Natural Language Generation via Pre-Training

Language Models are Universal Embedders

Toward Multilingual Neural Machine Translation with Universal Encoder and Decoder

UniCoder: Scaling Code Large Language Model via Universal Code

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

How Do Multilingual Encoders Learn Cross-lingual Representation?

Large Language Model as a Universal Clinical Multi-task Decoder

UNITER: UNiversal Image-TExt Representation Learning

Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation

Improving Zero-shot Neural Machine Translation on Language-specific Encoders-Decoders

DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders

UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages