Abstract:Pretrained language models (PLMs) display impressive performances and have captured the attention of the NLP community. Establishing best practices in pretraining has, therefore, become a major focus of NLP research, especially since insights gained from monolingual English models may not necessarily apply to more complex multilingual models. One significant caveat of the current state of the art is that different works are rarely comparable: they often discuss different parameter counts, training data, and evaluation methodology. This paper proposes a comparison of multilingual pretraining objectives in a controlled methodological environment. We ensure that training data and model architectures are comparable, and discuss the downstream performances across 6 languages that we observe in probing and fine-tuning scenarios. We make two key observations: (1) the architecture dictates which pretraining objective is optimal; (2) multilingual translation is a very effective pretraining objective under the right conditions. We make our code, data, and model weights available at \texttt{\url{<a class="link-external link-https" href="https://github.com/Helsinki-NLP/lm-vs-mt" rel="external noopener nofollow">this https URL</a>}}.

What problem does this paper attempt to address?

This paper aims to solve the problem of performance comparison of multilingual pre - training models under different pre - training objectives. Specifically, the author focuses on the following two research questions: 1. **Can the explicit cross - language training signal of the translation objective promote the downstream performance in monolingual tasks?** 2. **Is the choice of the optimal architecture independent of the training objective?** ### Background and Motivation In recent years, pre - trained language models (PLMs) have achieved remarkable results in the field of natural language processing (NLP), attracting a large number of researchers' attention. However, there is a significant problem in current research: different works often use different numbers of parameters, training data, and evaluation methods, which makes it difficult to compare different models. Therefore, establishing the best pre - training practice has become an important focus of NLP research. ### Research Methods In order to make comparisons under strictly comparable conditions, the author designed the following experiments: - **Model Architecture**: Two types of model architectures were considered: - Double - stack model (double - encoder - decoder structure), such as BART. - Single - stack model (single - encoder or single - decoder structure), such as BERT and GPT. - **Pre - training Objectives**: - Language modeling (LM) objectives: including autoregressive causal language model (CLM) and masked language model (MLM). - Translation (MT) objective: used for machine translation tasks. ### Experimental Setup - **Dataset**: Two datasets, UNPC and OpenSubtitles, were used, covering six languages (Arabic, Chinese, English, French, Russian, and Spanish). - **Training Conditions**: Ensure that all models are trained under the same conditions, including the same tokenization strategy, number of network layers, hidden layer dimension, number of attention heads, and feed - forward layer dimension. - **Evaluation Tasks**: Including sentiment analysis (SA), named entity recognition (NER), part - of - speech tagging (POS), and natural language inference (NLI). ### Main Findings - **Double - stack Model**: - The BART model pre - trained with the translation objective (MT) outperforms the model pre - trained with the denoising objective (LM) on all tasks. - **Single - stack Model**: - In the probing experiment, the causal language model (CLM) usually performs the best, but on some tasks (such as NLI and Arabic NER), the masked language model (MLM) performs better. - In the fine - tuning experiment, MLM performs the best on most tasks, while TLM performs excellently on the sentiment analysis (SA) task. ### Conclusions - **Effectiveness of the Translation Objective**: In specific cases, the translation objective can be a very effective pre - training objective, especially in the double - stack model. - **Relationship between Architecture and Pre - training Objective**: The optimal pre - training objective depends on the model's architecture. - **Future Work**: It is recommended to explore multi - task learning, combining translation, denoising, and language modeling objectives to improve the model's robustness and generality. Through strict experimental design and comparison, this study provides valuable insights for the optimization of multilingual pre - training models.

A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives

GreenPLM: Cross-Lingual Transfer of Monolingual Pre-Trained Language Models at Almost No Cost

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Multimodal Pretraining from Monolingual to Multilingual

TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data

How Does Pretraining Improve Discourse-Aware Translation?

mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models

Multilingual Pre-training with Universal Dependency Learning.

TIM: Teaching Large Language Models to Translate with Comparison

Multilingual Translation with Extensible Multilingual Pretraining and Finetuning

Unraveling the Potential of Large Language Models in Code Translation: How Far Are We?

On the comparability of Pre-trained Language Models

A Novel Paradigm Boosting Translation Capabilities of Large Language Models

TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models

Training Bilingual LMs with Data Constraints in the Targeted Language

Can Machine Translation Bridge Multilingual Pretraining and Cross-lingual Transfer Learning?

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

Forging Multiple Training Objectives for Pre-trained Language Models via Meta-Learning

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis

A Study of Pre-trained Language Models in Natural Language Processing

Cross-Lingual Language Model Meta-Pretraining