Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

Wenzhen Zheng,Wenbo Pan,Xu Xu,Libo Qin,Li Yue,Ming Zhou
2024-10-02
Abstract:In recent years, Large Language Models (LLMs) have made significant strides towards Artificial General Intelligence. However, training these models from scratch requires substantial computational resources and vast amounts of text data. In this paper, we explore an alternative approach to constructing an LLM for a new language by continually pretraining (CPT) from existing pretrained LLMs, instead of using randomly initialized parameters. Based on parallel experiments on 40 model sizes ranging from 40M to 5B parameters, we find that 1) CPT converges faster and saves significant resources in a scalable manner; 2) CPT adheres to an extended scaling law derived from Hoffmann et al. (2022) with a joint data-parameter scaling term; 3) The compute-optimal data-parameter allocation for CPT markedly differs based on our estimated scaling factors; 4) The effectiveness of transfer at scale is influenced by training duration and linguistic properties, while robust to data replaying, a method that effectively mitigates catastrophic forgetting in CPT. We hope our findings provide deeper insights into the transferability of LLMs at scale for the research community.
Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper attempts to address the issue of resource consumption when building large-scale language models (LLMs) for new languages. Specifically, training a large-scale language model from scratch requires a significant amount of computational resources and text data, which is often difficult to achieve in practical applications. Therefore, the authors propose an alternative method, namely Continual Pre-Training (CPT), to build LLMs for new languages instead of training from randomly initialized parameters. ### Main Research Content 1. **Advantages of Continual Pre-Training (CPT)**: - The authors compared 40 models with different parameter sizes (from 40M to 5B parameters) through parallel experiments and found that CPT outperforms pre-training from scratch in several aspects: - **Faster Convergence**: CPT models converge faster under the same computational resources. - **Resource Savings**: CPT models significantly save computational resources at different scales. - **Better Scalability**: CPT follows a scaling law derived by Hoffmann et al. (2022) and includes a joint data-parameter scaling term. - **Different Data-Parameter Allocation**: The optimal data-parameter allocation for CPT differs from that of models pre-trained from scratch. 2. **Effectiveness of Cross-Language Transfer**: - The study found that the effectiveness of cross-language transfer is influenced by training time and language characteristics but is robust to data replay (an effective method to mitigate catastrophic forgetting). 3. **Experimental Setup**: - The authors used the same decoder-only Transformer architecture (similar to LLaMA2) and conducted experiments in multiple languages, including English, Chinese, French, and Russian. - Experimental data came from publicly available datasets such as RedPajama, Common Crawl, and WuDao. 4. **Evaluation Tasks**: - The authors primarily used cross-entropy loss as the performance metric and evaluated the models on various multilingual benchmarks, including XNLI, Multilingual Winograde, Multilingual Hellaswag, etc. 5. **Scaling Law**: - The authors proposed a new scaling law to describe the loss-computation relationship for CPT and demonstrated through experiments that this law is more suitable for CPT than the Chinchilla law. 6. **Role of Data Replay**: - Data replay plays a crucial role in CPT, effectively mitigating catastrophic forgetting, especially when the target language is highly similar to the source language. ### Key Findings - **CPT consistently shows lower loss during training**, especially in the early stages, with significantly faster convergence. - **CPT demonstrates advantages at different scales**, particularly when computational resources are limited, where a larger parameter size is more beneficial than more training data. - **Data replay can effectively prevent catastrophic forgetting**, especially in smaller models, with the optimal replay ratio being between 5% and 30%. ### Conclusion Through systematic research and experiments, this paper demonstrates the effectiveness and advantages of Continual Pre-Training (CPT) in building large-scale language models for new languages. CPT not only has significant advantages in resource consumption but also effectively mitigates catastrophic forgetting, providing important references for future research and applications.