Simple and Scalable Strategies to Continually Pre-train Large Language Models

Adam Ibrahim,Benjamin Thérien,Kshitij Gupta,Mats L. Richter,Quentin Anthony,Timothée Lesort,Eugene Belilovsky,Irina Rish
2024-09-05
Abstract:Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve continual pre - training in large - language models (LLMs) in order to reduce the cost and resources required for retraining the models. Specifically, the paper focuses on how to update these models through simple and scalable methods when new data keeps emerging, without completely retraining from scratch. The paper points out that the current practice is usually to retrain the model from scratch when new data is available, which not only consumes a large amount of computing resources but is also inefficient. In addition, the distribution change of new data usually leads to a decline in the model's performance on old data or poor adaptation to new data. To address these problems, the paper proposes a strategy that combines learning rate re - warming, learning rate re - decaying, and replay of previous data. These strategies aim to enable the model to effectively adapt to new data while maintaining its performance on old data without significantly increasing the computing cost. The paper verifies the effectiveness of these methods through experiments, especially in dealing with data sets of different scales and distribution changes of different intensities. The experimental results show that the model continuously pre - trained using these strategies can achieve performance comparable to that of a model completely retrained from random initialization, but requires much fewer computing resources. In addition, the paper also explores infinite learning rate schedules as a possible method for improving continual pre - training. In conclusion, the main contribution of this paper lies in providing a set of simple and effective techniques, enabling large - language models to maintain and improve their performance at a lower cost when facing new data.