Abstract:Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly understood. This paper introduces $\textbf{BiMix}$, a novel bivariate data mixing law that models the joint scaling behavior of domain proportions and data volume in LLM pretraining. $\textbf{BiMix}$ provides a systematic framework for understanding and optimizing data mixtures across diverse domains. Through extensive experiments on two large-scale datasets, we demonstrate $\textbf{BiMix}$'s high accuracy in loss extrapolation (mean relative error < 0.2%) and its generalization to unseen mixtures (R${}^{2}$ > 0.97). Optimization of domain proportions yields superior model performance compared to existing methods. Furthermore, we establish entropy-based measures as efficient proxies for data mixing, offering a computationally lightweight strategy. Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency, paving the way for more effective scaling strategies in language model development.

What problem does this paper attempt to address?

The paper attempts to address the issue of insufficient understanding of how the composition of pre-training data for large language models (LLMs) affects model performance. Specifically, although LLMs perform excellently in various tasks, current research on how the proportion and amount of data from different domains jointly affect model performance is neither systematic nor in-depth enough. Therefore, the paper introduces BIMIX, a new bivariate data mixing law, aimed at modeling and optimizing the joint scaling behavior of domain proportions and data amounts in multi-source datasets. ### Main Issues 1. **Understanding the Impact of Data Mixing**: Current understanding of pre-training data mixing strategies mainly relies on heuristic methods or computationally expensive optimization techniques, lacking a general framework to understand and predict the scaling behavior of mixed-domain training. 2. **Optimizing Data Mixing**: Existing methods require substantial computational resources to optimize data domain proportions and lack a systematic theoretical foundation to guide efficient data mixing. 3. **Predicting Model Performance**: There is a lack of an effective method to predict model performance under different data mixes, which limits the efficient allocation of resources and optimization of model performance. ### Solution - **BIMIX Model**: BIMIX relates domain proportions and total training volume to model performance through mathematical formulas, providing a powerful tool to predict and optimize training outcomes. - **Experimental Validation**: Extensive experiments on two large-scale datasets validate BIMIX's high accuracy in loss extrapolation (average relative error <0.2%) and generalization to unseen mixes (R² > 0.97). - **Entropy-Based Metric**: Introduces an entropy-based metric as an efficient proxy for data mixing, offering a computationally lightweight strategy to optimize data mixing. ### Contributions 1. **Mathematical Formulation of Mixing Law**: Jointly models the scaling behavior of domain proportions and total training volume, with good interpretability and functional extensibility. 2. **Extensive Experimental Validation**: Demonstrates the effectiveness of BIMIX in predicting and optimizing model performance, applicable to various datasets and training scenarios. 3. **Empirical Support for Entropy-Based Metric**: Provides empirical evidence for using the entropy-based metric as a lightweight mixing proxy, offering new insights for efficient data mixing optimization. Through these contributions, BIMIX not only fills the gaps in existing research but also provides theoretical and practical tools to improve the efficiency of large language model training and develop more effective scaling strategies.

BiMix: Bivariate Data Mixing Law for Language Model Pretraining

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

RegMix: Data Mixture as Regression for Language Model Pre-training

Efficient Online Data Mixing For Language Model Pre-Training

Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning

A Data Cartography based MixUp for Pre-trained Language Models

Aioli: A Unified Optimization Framework for Language Model Data Mixing

Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral

AutoMix: Automatically Mixing Language Models

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Harnessing Hard Mixed Samples with Decoupled Regularizer

MixMix: All You Need for Data-Free Compression Are Feature and Data Mixing

TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

Decoupled Mixup for Data-efficient Learning

On the Diversity of Synthetic Data and its Impact on Training Large Language Models

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Code-mixed LLM: Improve Large Language Models' Capability to Handle Code-Mixing through Reinforcement Learning from AI Feedback

Bayesian DivideMix++ for enhanced learning with noisy labels