Abstract:In recent years, Large Language Models (LLMs) have made significant strides towards Artificial General Intelligence. However, training these models from scratch requires substantial computational resources and vast amounts of text data. In this paper, we explore an alternative approach to constructing an LLM for a new language by continually pretraining (CPT) from existing pretrained LLMs, instead of using randomly initialized parameters. Based on parallel experiments on 40 model sizes ranging from 40M to 5B parameters, we find that 1) CPT converges faster and saves significant resources in a scalable manner; 2) CPT adheres to an extended scaling law derived from Hoffmann et al. (2022) with a joint data-parameter scaling term; 3) The compute-optimal data-parameter allocation for CPT markedly differs based on our estimated scaling factors; 4) The effectiveness of transfer at scale is influenced by training duration and linguistic properties, while robust to data replaying, a method that effectively mitigates catastrophic forgetting in CPT. We hope our findings provide deeper insights into the transferability of LLMs at scale for the research community.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper attempts to address the issue of resource consumption when building large-scale language models (LLMs) for new languages. Specifically, training a large-scale language model from scratch requires a significant amount of computational resources and text data, which is often difficult to achieve in practical applications. Therefore, the authors propose an alternative method, namely Continual Pre-Training (CPT), to build LLMs for new languages instead of training from randomly initialized parameters. ### Main Research Content 1. **Advantages of Continual Pre-Training (CPT)**: - The authors compared 40 models with different parameter sizes (from 40M to 5B parameters) through parallel experiments and found that CPT outperforms pre-training from scratch in several aspects: - **Faster Convergence**: CPT models converge faster under the same computational resources. - **Resource Savings**: CPT models significantly save computational resources at different scales. - **Better Scalability**: CPT follows a scaling law derived by Hoffmann et al. (2022) and includes a joint data-parameter scaling term. - **Different Data-Parameter Allocation**: The optimal data-parameter allocation for CPT differs from that of models pre-trained from scratch. 2. **Effectiveness of Cross-Language Transfer**: - The study found that the effectiveness of cross-language transfer is influenced by training time and language characteristics but is robust to data replay (an effective method to mitigate catastrophic forgetting). 3. **Experimental Setup**: - The authors used the same decoder-only Transformer architecture (similar to LLaMA2) and conducted experiments in multiple languages, including English, Chinese, French, and Russian. - Experimental data came from publicly available datasets such as RedPajama, Common Crawl, and WuDao. 4. **Evaluation Tasks**: - The authors primarily used cross-entropy loss as the performance metric and evaluated the models on various multilingual benchmarks, including XNLI, Multilingual Winograde, Multilingual Hellaswag, etc. 5. **Scaling Law**: - The authors proposed a new scaling law to describe the loss-computation relationship for CPT and demonstrated through experiments that this law is more suitable for CPT than the Chinchilla law. 6. **Role of Data Replay**: - Data replay plays a crucial role in CPT, effectively mitigating catastrophic forgetting, especially when the target language is highly similar to the source language. ### Key Findings - **CPT consistently shows lower loss during training**, especially in the early stages, with significantly faster convergence. - **CPT demonstrates advantages at different scales**, particularly when computational resources are limited, where a larger parameter size is more beneficial than more training data. - **Data replay can effectively prevent catastrophic forgetting**, especially in smaller models, with the optimal replay ratio being between 5% and 30%. ### Conclusion Through systematic research and experiments, this paper demonstrates the effectiveness and advantages of Continual Pre-Training (CPT) in building large-scale language models for new languages. CPT not only has significant advantages in resource consumption but also effectively mitigates catastrophic forgetting, providing important references for future research and applications.

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention

Towards Effective and Efficient Continual Pre-training of Large Language Models

Distributed Training of Large Language Models

A Survey of Large Language Models

The Rise and Down of Babel Tower: Investigating the Evolution Process of Multilingual Code Large Language Model

Revealing the Parallel Multilingual Learning within Large Language Models

Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment

Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models

Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment

Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions

Training Bilingual LMs with Data Constraints in the Targeted Language

A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

Patch-Level Training for Large Language Models

Investigating Continual Pretraining in Large Language Models: Insights and Implications

MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting

Efficient Continual Pre-training of LLMs for Low-resource Languages