Investigating Continual Pretraining in Large Language Models: Insights and Implications

Çağatay Yıldız,Nishaanth Kanna Ravichandran,Prishruit Punia,Matthias Bethge,Beyza Ermis
2024-02-27
Abstract:This paper studies the evolving domain of Continual Learning (CL) in large language models (LLMs), with a focus on developing strategies for efficient and sustainable training. Our primary emphasis is on continual domain-adaptive pretraining, a process designed to equip LLMs with the ability to integrate new information from various domains while retaining previously learned knowledge and enhancing cross-domain knowledge transfer without relying on domain-specific identification. Unlike previous studies, which mostly concentrate on a limited selection of tasks or domains and primarily aim to address the issue of forgetting, our research evaluates the adaptability and capabilities of LLMs to changing data landscapes in practical scenarios. To this end, we introduce a new benchmark designed to measure the adaptability of LLMs to these evolving data environments, offering a comprehensive framework for evaluation. We examine the impact of model size on learning efficacy and forgetting, as well as how the progression and similarity of emerging domains affect the knowledge transfer within these models. Our findings uncover several key insights: (i) when the sequence of domains shows semantic similarity, continual pretraining enables LLMs to better specialize in the current domain compared to stand-alone fine-tuning, (ii) training across a diverse range of domains enhances both backward and forward knowledge transfer, and (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both forgetting and learning. We posit that our research marks a shift towards establishing a more realistic benchmark for investigating CL in LLMs, and has the potential to play a key role in guiding the direction of future research in the field.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve continual pretraining in large - language models (LLMs) in order to improve the model's adaptability and knowledge transfer ability in an ever - changing data environment, while reducing the dependence on domain - specific identifiers and avoiding catastrophic forgetting. Specifically, the research focuses on developing effective continual domain - adaptation pretraining strategies, enabling LLMs to integrate new information without retraining, retain the knowledge already learned, and enhance cross - domain knowledge transfer. ### Main Problems 1. **Adapting to New Data Environments**: How can LLMs adapt to an ever - changing data environment, especially new domain data? 2. **Avoiding Catastrophic Forgetting**: How can the model be prevented from forgetting the knowledge it has learned before while introducing new data? 3. **Cross - Domain Knowledge Transfer**: How can the model's knowledge transfer ability in different domains be improved, especially in long - sequence continual learning? 4. **The Influence of Model Scale**: How do models of different scales perform in continual pretraining? 5. **The Influence of Domain Similarity**: How does the similarity between domains affect the effect of continual pretraining? ### Research Methods - **Dataset**: Use the Massively Multi - Domain Dataset (M2D2), which contains 236 hierarchically organized domains, and the data sources include Wikipedia and Semantic Scholar. - **Model**: Evaluate pretrained LLMs of different architectures and scales, including the GPT2 series (small, medium, large, extra - large) and the RoBERTa series (base, large). - **Training Order**: Test the training effects of training in the order sorted by domain similarity (similar - order) and in random order (random - order) respectively. - **Evaluation Metrics**: - **Zero - Shot Performance (Zero - Shot Perplexity, ZS)**: The prediction ability of the original model without any domain - specific tuning. - **Fine - Tuned Performance (Fine - Tuned Perplexity, FT)**: The performance of the model after fine - tuning for each domain. - **Continual Pretraining Performance (Continual Pretraining Perplexity, CPT)**: The performance of the model in the most recently trained domain. - **Last Checkpoint Performance (Last Checkpoint Perplexity, LC)**: The performance of the final model in all trained domains. - **Forward Transfer**: The performance of the model in unseen future domains. - **Backward Transfer**: The performance of the model in seen past domains. ### Main Findings 1. **Continual Pretraining Is Superior to Individual Fine - Tuning**: When there is semantic similarity between domains, continual pretraining can make the model more specialized in the current domain than individual fine - tuning. 2. **Diversity Training Enhances Knowledge Transfer**: Training across multiple different domains can enhance forward and backward knowledge transfer. 3. **Small Models Are More Sensitive**: Small - scale models show higher forgetting rates and learning rates in continual pretraining. 4. **Random Order Is Superior to Similar Order**: In random order, the model has a lower average perplexity and better backward transfer performance. 5. **Later Training Is More Prone to Forgetting**: In the later stages of continual learning, the model is more likely to forget the knowledge learned in the early stages. 6. **Model Scale Has a Significant Influence**: Large - scale models perform better in continual pretraining, but this advantage is partly attributed to the improvement of their zero - shot performance. ### Conclusion This research, through extensive experiments and evaluations, reveals the potential and challenges of continual pretraining in large - language models, providing important references and guidance for future continual learning research.