Language models scale reliably with over-training and on downstream tasks

Samir Yitzhak Gadre,Georgios Smyrnis,Vaishaal Shankar,Suchin Gururangan,Mitchell Wortsman,Rulin Shao,Jean Mercat,Alex Fang,Jeffrey Li,Sedrick Keh,Rui Xin,Marianna Nezhurina,Igor Vasiljevic,Jenia Jitsev,Luca Soldaini,Alexandros G. Dimakis,Gabriel Ilharco,Pang Wei Koh,Shuran Song,Thomas Kollar,Yair Carmon,Achal Dave,Reinhard Heckel,Niklas Muennighoff,Ludwig Schmidt
2024-06-15
Abstract:Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime). In contrast, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but models are usually compared on downstream task performance. To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32$\times$ over-trained) and a 6.9B parameter, 138B token run (i.e., a compute-optimal run)$\unicode{x2014}$each from experiments that take 300$\times$ less compute. Second, we relate the perplexity of a language model to its downstream task performance by proposing a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models, using experiments that take 20$\times$ less compute. Our experiments are available at <a class="link-external link-https" href="https://github.com/mlfoundations/scaling" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address several key issues in the training and evaluation of language models: 1. **Predictability of Over-training**: - Current scaling laws are typically studied under compute-optimal training regimes, such as the "Chinchilla optimal" regime. However, in practical applications, models are often over-trained to reduce inference costs. - The authors created a testbed containing 104 models with parameters ranging from 0.011B to 6.9B, trained on different data distributions to study the effects of over-training. - They found that even in the case of over-training, the validation loss of the models can still be reliably predicted using scaling laws. 2. **Prediction of Downstream Task Performance**: - Most existing scaling laws primarily predict the perplexity of the next word, whereas in practical applications, researchers are more concerned with the model's performance on downstream tasks. - The authors propose a power-law relationship that links the perplexity of language models to their average Top-1 error rate on multiple downstream tasks, enabling the prediction of model performance on these tasks. ### Main Contributions 1. **Scaling Laws for Over-training**: - The authors experimentally discovered that even in the case of over-training, the validation loss of models still follows a power-law relationship \( L'(C) = \lambda \cdot C^{-\eta} \), where \( \eta \) is the scaling exponent and \( \lambda \) is a constant related to the training token ratio \( M \). - Using these scaling laws, they were able to accurately predict the performance of a 1.4B parameter model with 900B tokens on the C4 validation set using 300 times less computational resources. 2. **Prediction of Downstream Task Performance**: - The authors proposed a power-law relationship \( \text{Err}(L) = \epsilon - k \cdot \exp(-\gamma L) \), linking the perplexity of the model to its average Top-1 error rate on downstream tasks. - Through this relationship, they were able to accurately predict the average Top-1 error rate of a 6.9B parameter model with 138B tokens on 17 downstream tasks using 20 times less computational resources. ### Experimental Setup - **Model Configuration**: The authors determined the model configurations with parameters ranging from 0.011B to 0.411B through grid search and trained them on three different datasets: C4, RedPajama, and RefinedWeb. - **Training and Validation**: All models were evaluated on the C4 validation set, and downstream task evaluations were conducted using 17 tasks from LLM-Foundry. - **Fitting Scaling Laws**: The Levenberg-Marquardt algorithm in SciPy was used to fit the scaling laws for loss and downstream error rates. ### Results - **Predictability of Over-training Performance**: The authors successfully predicted the performance of a 1.4B parameter model with 900B tokens on the C4 validation set using 300 times less computational resources, with a relative error of only 0.7%. - **Predictability of Downstream Task Performance**: The authors were able to accurately predict the average Top-1 error rate of a 6.9B parameter model with 138B tokens on 17 downstream tasks using 20 times less computational resources, with a relative error of only 0.05%. ### Conclusion Through extensive experiments, this paper demonstrates that even in the case of over-training, the performance of language models can still be reliably predicted using scaling laws. Additionally, they propose a method to link the perplexity of the model to its performance on downstream tasks, further improving the efficiency and reliability of model development.