BiMix: Bivariate Data Mixing Law for Language Model Pretraining

Ce Ge,Zhijian Ma,Daoyuan Chen,Yaliang Li,Bolin Ding
2024-10-15
Abstract:Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly understood. This paper introduces $\textbf{BiMix}$, a novel bivariate data mixing law that models the joint scaling behavior of domain proportions and data volume in LLM pretraining. $\textbf{BiMix}$ provides a systematic framework for understanding and optimizing data mixtures across diverse domains. Through extensive experiments on two large-scale datasets, we demonstrate $\textbf{BiMix}$'s high accuracy in loss extrapolation (mean relative error < 0.2%) and its generalization to unseen mixtures (R${}^{2}$ > 0.97). Optimization of domain proportions yields superior model performance compared to existing methods. Furthermore, we establish entropy-based measures as efficient proxies for data mixing, offering a computationally lightweight strategy. Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency, paving the way for more effective scaling strategies in language model development.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of insufficient understanding of how the composition of pre-training data for large language models (LLMs) affects model performance. Specifically, although LLMs perform excellently in various tasks, current research on how the proportion and amount of data from different domains jointly affect model performance is neither systematic nor in-depth enough. Therefore, the paper introduces BIMIX, a new bivariate data mixing law, aimed at modeling and optimizing the joint scaling behavior of domain proportions and data amounts in multi-source datasets. ### Main Issues 1. **Understanding the Impact of Data Mixing**: Current understanding of pre-training data mixing strategies mainly relies on heuristic methods or computationally expensive optimization techniques, lacking a general framework to understand and predict the scaling behavior of mixed-domain training. 2. **Optimizing Data Mixing**: Existing methods require substantial computational resources to optimize data domain proportions and lack a systematic theoretical foundation to guide efficient data mixing. 3. **Predicting Model Performance**: There is a lack of an effective method to predict model performance under different data mixes, which limits the efficient allocation of resources and optimization of model performance. ### Solution - **BIMIX Model**: BIMIX relates domain proportions and total training volume to model performance through mathematical formulas, providing a powerful tool to predict and optimize training outcomes. - **Experimental Validation**: Extensive experiments on two large-scale datasets validate BIMIX's high accuracy in loss extrapolation (average relative error <0.2%) and generalization to unseen mixes (R² > 0.97). - **Entropy-Based Metric**: Introduces an entropy-based metric as an efficient proxy for data mixing, offering a computationally lightweight strategy to optimize data mixing. ### Contributions 1. **Mathematical Formulation of Mixing Law**: Jointly models the scaling behavior of domain proportions and total training volume, with good interpretability and functional extensibility. 2. **Extensive Experimental Validation**: Demonstrates the effectiveness of BIMIX in predicting and optimizing model performance, applicable to various datasets and training scenarios. 3. **Empirical Support for Entropy-Based Metric**: Provides empirical evidence for using the entropy-based metric as a lightweight mixing proxy, offering new insights for efficient data mixing optimization. Through these contributions, BIMIX not only fills the gaps in existing research but also provides theoretical and practical tools to improve the efficiency of large language model training and develop more effective scaling strategies.