Simple and Scalable Strategies to Continually Pre-train Large Language Models

Adam Ibrahim,Benjamin Thérien,Kshitij Gupta,Mats L. Richter,Quentin Anthony,Timothée Lesort,Eugene Belilovsky,Irina Rish

2024-09-05

Abstract:Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.

Machine Learning,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve continual pre - training in large - language models (LLMs) in order to reduce the cost and resources required for retraining the models. Specifically, the paper focuses on how to update these models through simple and scalable methods when new data keeps emerging, without completely retraining from scratch. The paper points out that the current practice is usually to retrain the model from scratch when new data is available, which not only consumes a large amount of computing resources but is also inefficient. In addition, the distribution change of new data usually leads to a decline in the model's performance on old data or poor adaptation to new data. To address these problems, the paper proposes a strategy that combines learning rate re - warming, learning rate re - decaying, and replay of previous data. These strategies aim to enable the model to effectively adapt to new data while maintaining its performance on old data without significantly increasing the computing cost. The paper verifies the effectiveness of these methods through experiments, especially in dealing with data sets of different scales and distribution changes of different intensities. The experimental results show that the model continuously pre - trained using these strategies can achieve performance comparable to that of a model completely retrained from random initialization, but requires much fewer computing resources. In addition, the paper also explores infinite learning rate schedules as a possible method for improving continual pre - training. In conclusion, the main contribution of this paper lies in providing a set of simple and effective techniques, enabling large - language models to maintain and improve their performance at a lower cost when facing new data.

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Continual Pre-Training of Large Language Models: How to (re)warm your model?

Continual Learning for Large Language Models: A Survey

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

Efficiently Adapting Pretrained Language Models To New Languages

Scalable Language Model with Generalized Continual Learning

Efficient Continual Pre-training by Mitigating the Stability Gap

Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models

A Learning Rate Path Switching Training Paradigm for Version Updates of Large Language Models

Investigating Continual Pretraining in Large Language Models: Insights and Implications

On the Usage of Continual Learning for Out-of-Distribution Generalization in Pre-trained Language Models of Code

Continual Learning Under Language Shift

Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

Towards Practical Tool Usage for Continually Learning LLMs

Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora

Large-scale Language Model Rescoring on Long-form Data

Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review

Temporal Scaling Law for Large Language Models

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging

Improving Multimodal Large Language Models Using Continual Learning