Abstract:Pretrained language models (PLMs) display impressive performances and have captured the attention of the NLP community. Establishing best practices in pretraining has, therefore, become a major focus of NLP research, especially since insights gained from monolingual English models may not necessarily apply to more complex multilingual models. One significant caveat of the current state of the art is that different works are rarely comparable: they often discuss different parameter counts, training data, and evaluation methodology.
This paper proposes a comparison of multilingual pretraining objectives in a controlled methodological environment. We ensure that training data and model architectures are comparable, and discuss the downstream performances across 6 languages that we observe in probing and fine-tuning scenarios. We make two key observations: (1) the architecture dictates which pretraining objective is optimal; (2) multilingual translation is a very effective pretraining objective under the right conditions. We make our code, data, and model weights available at \texttt{\url{<a class="link-external link-https" href="https://github.com/Helsinki-NLP/lm-vs-mt" rel="external noopener nofollow">this https URL</a>}}.
What problem does this paper attempt to address?
This paper aims to solve the problem of performance comparison of multilingual pre - training models under different pre - training objectives. Specifically, the author focuses on the following two research questions:
1. **Can the explicit cross - language training signal of the translation objective promote the downstream performance in monolingual tasks?**
2. **Is the choice of the optimal architecture independent of the training objective?**
### Background and Motivation
In recent years, pre - trained language models (PLMs) have achieved remarkable results in the field of natural language processing (NLP), attracting a large number of researchers' attention. However, there is a significant problem in current research: different works often use different numbers of parameters, training data, and evaluation methods, which makes it difficult to compare different models. Therefore, establishing the best pre - training practice has become an important focus of NLP research.
### Research Methods
In order to make comparisons under strictly comparable conditions, the author designed the following experiments:
- **Model Architecture**: Two types of model architectures were considered:
- Double - stack model (double - encoder - decoder structure), such as BART.
- Single - stack model (single - encoder or single - decoder structure), such as BERT and GPT.
- **Pre - training Objectives**:
- Language modeling (LM) objectives: including autoregressive causal language model (CLM) and masked language model (MLM).
- Translation (MT) objective: used for machine translation tasks.
### Experimental Setup
- **Dataset**: Two datasets, UNPC and OpenSubtitles, were used, covering six languages (Arabic, Chinese, English, French, Russian, and Spanish).
- **Training Conditions**: Ensure that all models are trained under the same conditions, including the same tokenization strategy, number of network layers, hidden layer dimension, number of attention heads, and feed - forward layer dimension.
- **Evaluation Tasks**: Including sentiment analysis (SA), named entity recognition (NER), part - of - speech tagging (POS), and natural language inference (NLI).
### Main Findings
- **Double - stack Model**:
- The BART model pre - trained with the translation objective (MT) outperforms the model pre - trained with the denoising objective (LM) on all tasks.
- **Single - stack Model**:
- In the probing experiment, the causal language model (CLM) usually performs the best, but on some tasks (such as NLI and Arabic NER), the masked language model (MLM) performs better.
- In the fine - tuning experiment, MLM performs the best on most tasks, while TLM performs excellently on the sentiment analysis (SA) task.
### Conclusions
- **Effectiveness of the Translation Objective**: In specific cases, the translation objective can be a very effective pre - training objective, especially in the double - stack model.
- **Relationship between Architecture and Pre - training Objective**: The optimal pre - training objective depends on the model's architecture.
- **Future Work**: It is recommended to explore multi - task learning, combining translation, denoising, and language modeling objectives to improve the model's robustness and generality.
Through strict experimental design and comparison, this study provides valuable insights for the optimization of multilingual pre - training models.