Abstract:Continual pre-training has increasingly become the predominant approach for adapting Large Language Models (LLMs) to new domains. This process involves updating the pre-trained LLM with a corpus from a new domain, resulting in a shift in the training distribution. To study the behavior of LLMs during this shift, we measured the model's performance throughout the continual pre-training process. we observed a temporary performance drop at the beginning, followed by a recovery phase, a phenomenon known as the "stability gap," previously noted in vision models classifying new classes. To address this issue and enhance LLM performance within a fixed compute budget, we propose three effective strategies: (1) Continually pre-training the LLM on a subset with a proper size for multiple epochs, resulting in faster performance recovery than pre-training the LLM on a large corpus in a single epoch; (2) Pre-training the LLM only on high-quality sub-corpus, which rapidly boosts domain performance; and (3) Using a data mixture similar to the pre-training data to reduce distribution gap. We conduct various experiments on Llama-family models to validate the effectiveness of our strategies in both medical continual pre-training and instruction tuning. For example, our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget and enhance the average general task performance without causing forgetting. Furthermore, we apply our strategies to the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among current open-source models, and performs comparably to or even better than GPT-4 on several medical benchmarks. We release our models at \url{<a class="link-external link-https" href="https://huggingface.co/YiDuo1999/Llama-3-Physician-8B-Instruct" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The paper attempts to address the issue of stability gap that occurs during continuous pre-training, where the model's performance temporarily drops at the beginning and then gradually recovers. This phenomenon leads to inefficient domain performance improvement and forgetting of general task knowledge. To tackle this problem, the paper proposes three effective strategies: 1. **Multiple Rounds of Small Batch Pre-training**: Instead of pre-training a cycle with a large corpus at once, multiple cycles of pre-training are conducted with appropriately sized subsets to accelerate performance recovery. 2. **High-Quality Subset Pre-training**: Pre-training is conducted only on high-quality subsets to quickly enhance domain performance. 3. **Data Mixing Rate Similar to Pre-training Data**: Using a data mixing rate similar to the pre-training data to reduce distribution differences and stabilize the model's instruction-following ability during continuous pre-training. Through these strategies, the paper validates their effectiveness in continuous pre-training and instruction tuning in the medical domain, significantly improving the model's average performance on medical tasks. For example, experiments on the OpenLlama-3B model show that these strategies can increase the average medical task performance from 36.2% to 40.7% while consuming only 40% of the original computational resources, without causing the forgetting of general task capabilities. Additionally, these strategies were applied to the continuous pre-training and instruction tuning of the Llama-3-8B model, resulting in the Llama-3-Physician model, which outperformed other open-source models in multiple medical benchmarks and even rivaled GPT-4.

Efficient Continual Pre-training by Mitigating the Stability Gap

Towards Effective and Efficient Continual Pre-training of Large Language Models

Investigating Continual Pretraining in Large Language Models: Insights and Implications

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora

Recyclable Tuning for Continual Pre-training

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Continual Learning of Large Language Models: A Comprehensive Survey

Continual Pre-Training of Large Language Models: How to (re)warm your model?

Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs

Continuous Training and Fine-tuning for Domain-Specific Language Models in Medical Question Answering

SLCA: Slow Learner with Classifier Alignment for Continual Learning on a Pre-trained Model

PRETRAINED LANGUAGE MODEL IN CONTINUAL LEARNING: A COMPARATIVE STUDY

Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

Efficient Continual Pre-training of LLMs for Low-resource Languages

LLaCA: Multimodal Large Language Continual Assistant

A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

Can LLMs' Tuning Methods Work in Medical Multimodal Domain?