Abstract:Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or proprietary corpus. Continual pre-training (CPT) enhances LLM capabilities by imbuing new domain-specific or proprietary knowledge while replaying general corpus to prevent catastrophic forgetting. The data mixture ratio of general corpus and domain-specific corpus, however, has been chosen heuristically, leading to sub-optimal training efficiency in practice. In this context, we attempt to re-visit the scaling behavior of LLMs under the hood of CPT, and discover a power-law relationship between loss, mixture ratio, and training tokens scale. We formalize the trade-off between general and domain-specific capabilities, leading to a well-defined Critical Mixture Ratio (CMR) of general and domain data. By striking the balance, CMR maintains the model's general ability and achieves the desired domain transfer, ensuring the highest utilization of available resources. Considering the balance between efficiency and effectiveness, CMR can be regarded as the optimal mixture ratio. Through extensive experiments, we ascertain the predictability of CMR, propose CMR scaling law and have substantiated its generalization. These findings offer practical guidelines for optimizing LLM training in specialized domains, ensuring both general and domain-specific performance while efficiently managing training resources.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to determine the optimal data mixing ratio during the Continual Pre - Training (CPT) process to balance the losses between general - purpose data and domain - specific data, so as to effectively improve the performance in specific domains without sacrificing the general - purpose performance of the model. Specifically, the paper focuses on the following points: 1. **Existence of the Optimal Data Mixing Ratio**: The paper explores whether there is an optimal data mixing ratio (Critical Mixture Ratio, CMR) under a given model size and number of training tokens, such that the model can effectively adapt to a specific domain while maintaining its general - purpose ability. 2. **How CMR Varies with Model Size and Number of Training Tokens**: The paper studies the trend of CMR as the model size and the number of training tokens change, and proposes a method for predicting CMR, namely the CMR scaling law. 3. **Predictability of CMR**: The paper verifies the predictability of CMR through a large number of experiments and proposes a method for predicting CMR based on the number of training tokens. ### Main Contributions 1. **Formalizing the Trade - off in CPT**: The concept of feasible mixing ratio is introduced, and the balance point between maintaining general - purpose ability and enhancing domain - specific ability in the CPT process, namely CMR, is defined. 2. **Predictability of CMR**: Through experiments, it is found that there is a power - law relationship between the loss and the data mixing ratio and the number of training tokens. The CMR scaling law is proposed to predict the optimal mixing ratio. 3. **Importance of the CMR Scaling Law**: The CMR scaling law is of great significance for efficient domain transfer. It can be used to determine the most effective training configuration with limited data and computing resources. ### Key Results 1. **Trade - off Relationship**: During the CPT process, there is a trade - off relationship between the general - purpose loss and the domain - specific loss. The general - purpose loss gradually decreases after an initial increase, while the domain - specific loss continues to decrease. This relationship conforms to the power - law form and can be used to predict losses under different mixing ratios. 2. **Prediction of CMR**: Through loss prediction, the optimal mixing ratio can be predicted using the CMR scaling law. Experimental results show that CMR increases as the model size increases and is related to the distribution gap between the target domain and the general - purpose domain. 3. **Visualization Results**: Figure 1 shows the training curves under different model sizes and mixing ratios. The yellow dotted line marks the curves that meet the CPT objectives, indicating that there are feasible mixing ratios. ### Experimental Setup 1. **Data Preparation**: The general - purpose pre - training data set contains Chinese, English, and code corpora, with a total of 220 billion tokens. The domain - specific data sets include finance and academic papers, and each data set contains at least 20 billion tokens. 2. **Model Architecture**: The language model architecture used is the same as that of the Llama series, with the number of parameters ranging from 460 million to 3.1 billion. 3. **Training Process**: First, pre - training is carried out using a general - purpose data set of 200 billion tokens, and then CPT is carried out on the mixed data of a 20 - billion - token general - purpose data set and a domain - specific data set. ### Conclusion The paper verifies the existence and predictability of CMR through a large number of experiments and proposes the CMR scaling law. These findings provide practical guidance for optimizing the training of large - language models in specific domains, ensuring that the performance in specific domains is efficiently improved while maintaining general - purpose performance.

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Mix-CPT: A Domain Adaptation Framework via Decoupling Knowledge Learning and Format Alignment

BiMix: Bivariate Data Mixing Law for Language Model Pretraining

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Performance Law of Large Language Models

RegMix: Data Mixture as Regression for Language Model Pre-training

Scaling Law for Language Models Training Considering Batch Size

Scaling Laws for Mixed quantization in Large Language Models

Towards Effective and Efficient Continual Pre-training of Large Language Models

Cross-model Control: Improving Multiple Large Language Models in One-time Training

Efficient Continual Pre-training by Mitigating the Stability Gap

Scaling Laws for Predicting Downstream Performance in LLMs

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral

Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models

Temporal Scaling Law for Large Language Models