Does your data spark joy? Performance gains from domain upsampling at the end of training

Cody Blakeney,Mansheej Paul,Brett W. Larsen,Sean Owen,Jonathan Frankle
2024-06-06
Abstract:Pretraining datasets for large language models (LLMs) have grown to trillions of tokens composed of large amounts of CommonCrawl (CC) web scrape along with smaller, domain-specific datasets. It is expensive to understand the impact of these domain-specific datasets on model capabilities as training at large FLOP scales is required to reveal significant changes to difficult and emergent benchmarks. Given the increasing cost of experimenting with pretraining data, how does one determine the optimal balance between the diversity in general web scrapes and the information density of domain specific data? In this work, we show how to leverage the smaller domain specific datasets by upsampling them relative to CC at the end of training to drive performance improvements on difficult benchmarks. This simple technique allows us to improve up to 6.90 pp on MMLU, 8.26 pp on GSM8K, and 6.17 pp on HumanEval relative to the base data mix for a 7B model trained for 1 trillion (T) tokens, thus rivaling Llama-2 (7B)$\unicode{x2014}$a model trained for twice as long. We experiment with ablating the duration of domain upsampling from 5% to 30% of training and find that 10% to 20% percent is optimal for navigating the tradeoff between general language modeling capabilities and targeted benchmarks. We also use domain upsampling to characterize at scale the utility of individual datasets for improving various benchmarks by removing them during this final phase of training. This tool opens up the ability to experiment with the impact of different pretraining datasets at scale, but at an order of magnitude lower cost compared to full pretraining runs.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to find the optimal balance between general web - crawled data (such as CommonCrawl) and domain - specific data in the pre - training data of large - scale language models (LLMs). As the scale of pre - training data sets grows, it becomes very expensive to experimentally verify the impact of different data mixing strategies. Therefore, a major challenge for researchers is how to effectively evaluate and optimize these data - mixing strategies without conducting a full pre - training. Specifically, the paper proposes a simple but effective method - **Domain Upsampling**, that is, increasing the proportion of domain - specific data in the later stage of training to improve the performance of the model on difficult benchmark tests. This method not only significantly improves the model performance, but also the experimental cost is much lower than the full pre - training process. Through this method, researchers can explore the specific impact of different data sets on the model's capabilities at a lower cost, thus providing a new tool and method for the pre - training data selection of large - scale language models.