Abstract:Continual pre-training has increasingly become the predominant approach for adapting Large Language Models (LLMs) to new domains. This process involves updating the pre-trained LLM with a corpus from a new domain, resulting in a shift in the training distribution. To study the behavior of LLMs during this shift, we measured the model's performance throughout the continual pre-training process. we observed a temporary performance drop at the beginning, followed by a recovery phase, a phenomenon known as the "stability gap," previously noted in vision models classifying new classes. To address this issue and enhance LLM performance within a fixed compute budget, we propose three effective strategies: (1) Continually pre-training the LLM on a subset with a proper size for multiple epochs, resulting in faster performance recovery than pre-training the LLM on a large corpus in a single epoch; (2) Pre-training the LLM only on high-quality sub-corpus, which rapidly boosts domain performance; and (3) Using a data mixture similar to the pre-training data to reduce distribution gap. We conduct various experiments on Llama-family models to validate the effectiveness of our strategies in both medical continual pre-training and instruction tuning. For example, our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget and enhance the average general task performance without causing forgetting. Furthermore, we apply our strategies to the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among current open-source models, and performs comparably to or even better than GPT-4 on several medical benchmarks. We release our models at \url{<a class="link-external link-https" href="https://huggingface.co/YiDuo1999/Llama-3-Physician-8B-Instruct" rel="external noopener nofollow">this https URL</a>}.

Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation

Towards Effective and Efficient Continual Pre-training of Large Language Models

Efficient Continual Pre-training by Mitigating the Stability Gap

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

On Domain-Specific Post-Training for Multimodal Large Language Models

Does your data spark joy? Performance gains from domain upsampling at the end of training

Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on Nepali

Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

CatMemo at the FinLLM Challenge Task: Fine-Tuning Large Language Models using Data Fusion in Financial Applications

The Construction of Instruction-tuned LLMs for Finance without Instruction Data Using Continual Pretraining and Model Merging

Experience of Training a 1.7B-Parameter LLaMa Model From Scratch

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

SNFinLLM: Systematic and Nuanced Financial Domain Adaptation of Chinese Large Language Models

MAML-en-LLM: Model Agnostic Meta-Training of LLMs for Improved In-Context Learning

Investigating Continual Pretraining in Large Language Models: Insights and Implications

Continual Post-Training of Language Models

Enhancing Financial Domain Adaptation of Language Models via Model Augmentation

Continuous Training and Fine-tuning for Domain-Specific Language Models in Medical Question Answering

Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to Pre-trained Language Models Memories

DoPAMine: Domain-specific Pre-training Adaptation from seed-guided data Mining

SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation