AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs

Feiyang Kang,Yifan Sun,Bingbing Wen,Si Chen,Dawn Song,Rafid Mahmood,Ruoxi Jia
2024-10-13
Abstract:Domain reweighting is an emerging research area aimed at adjusting the relative weights of different data sources to improve the effectiveness and efficiency of language model pre-training. This paper demonstrates that the optimal composition of training data from different domains is scale-dependent, challenging the existing practice of determining optimal mixtures through small-scale experiments and directly applying them at larger scales. We derive an analytical model for the dependence of optimal weights on data scale and introduce *AutoScale*, a novel, practical approach for optimizing data compositions at potentially large training data scales. *AutoScale* first uses a principled optimization framework to find optimal compositions at smaller, feasible scales, then predicts optimal compositions at larger scales using our derived model. Our evaluation on GPT-2 Large and BERT pre-training demonstrates *AutoScale*'s effectiveness in improving training convergence and downstream performance. Particularly, for GPT-2 Large on RedPajama, *AutoScale* decreases validation perplexity 28% faster than baselines, with up to 38% speed-up over unweighted training, achieving the best performance across downstream tasks. This work provides insights into the varying benefits of data sources across training scales for language models, contributing to the burgeoning research on scale-dependent data curation. Code is open-sourced.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to automatically predict the optimal data combination according to different data scales during the pre - training process of large - scale language models (LLMs). Specifically, existing methods usually determine the optimal data combination through small - scale experiments and directly apply it to larger - scale training. However, this method assumes that the optimal data combination remains unchanged at different scales, which may not be true, resulting in poor performance during large - scale training. Therefore, this research aims to challenge this assumption, explore whether there are indeed differences in the optimal data combination at different scales, and propose a new method - AutoScale - to solve this problem. ### Main contributions: 1. **Algorithmic framework in principle**: The researchers proposed a two - level optimization problem to define the optimal domain mixing problem and developed a novel method to estimate the dependence of model loss on weights, thereby simplifying the two - level optimization problem into a single - level problem. This method only requires linear retraining of the model, making it feasible in exploratory research. 2. **Revealing and quantifying the scale - dependence of the optimal domain combination**: Through empirical research, the researchers found that the optimal data combination does change with the change of the training data scale. They further derived an analytical framework for modeling the functional relationship between the optimal data combination and the training data scale. 3. **Practical optimal domain mixing algorithm**: Although the first two methods are feasible on a small scale, their practicality on a large scale is limited. For this reason, the researchers proposed AutoScale. This method first finds the optimal data combination on a smaller, computationally feasible scale, and then uses the derived scale - dependence model to predict the optimal combination on a larger scale. 4. **Robust performance improvement across models and datasets**: The researchers evaluated AutoScale on multiple models and datasets, and the results show that it can significantly improve training efficiency and the performance of downstream tasks. For example, when pre - training GPT - 2 Large, AutoScale reduces the validation perplexity faster than the baseline method, with a maximum acceleration of 38%. ### Core methods: - **Two - level optimization problem**: Formalize the optimal domain mixing problem as a two - level optimization problem, where the outer - level optimization problem looks for the optimal domain weights, and the inner - level optimization problem is to train the model based on these weights. - **Scaling law approximation**: Use the Neural Scaling Laws to approximate the relationship between the validation loss and the amount of training data, thereby simplifying the two - level optimization problem into a single - level problem. - **AutoScale algorithm**: By finding the optimal data combination on a smaller scale and then using the derived scale - dependence model to predict the optimal combination on a larger scale, the optimization of large - scale training data combination is achieved. ### Experimental results: - **Rapid decline in validation perplexity**: On GPT - 2 Large, AutoScale reduces the validation perplexity faster than the baseline method, with a maximum acceleration of 38%. - **Performance improvement in downstream tasks**: AutoScale shows the best overall performance on multiple downstream tasks. - **Benefit change of data sources**: The study found that data sources traditionally considered high - quality (such as Wikipedia and scientific papers) are most beneficial during small - scale training, but their benefits decline rapidly during large - scale training. On the contrary, data sources containing diverse examples (such as CommonCrawl) continue to perform well during large - scale training. In conclusion, through theoretical analysis and empirical research, this paper proves that the optimal data combination changes with the change of the training data scale, and proposes a practical method to automatically predict the optimal data combination in large - scale training.