Does your data spark joy? Performance gains from domain upsampling at the end of training

Cody Blakeney,Mansheej Paul,Brett W. Larsen,Sean Owen,Jonathan Frankle

2024-06-06

Abstract:Pretraining datasets for large language models (LLMs) have grown to trillions of tokens composed of large amounts of CommonCrawl (CC) web scrape along with smaller, domain-specific datasets. It is expensive to understand the impact of these domain-specific datasets on model capabilities as training at large FLOP scales is required to reveal significant changes to difficult and emergent benchmarks. Given the increasing cost of experimenting with pretraining data, how does one determine the optimal balance between the diversity in general web scrapes and the information density of domain specific data? In this work, we show how to leverage the smaller domain specific datasets by upsampling them relative to CC at the end of training to drive performance improvements on difficult benchmarks. This simple technique allows us to improve up to 6.90 pp on MMLU, 8.26 pp on GSM8K, and 6.17 pp on HumanEval relative to the base data mix for a 7B model trained for 1 trillion (T) tokens, thus rivaling Llama-2 (7B)$\unicode{x2014}$a model trained for twice as long. We experiment with ablating the duration of domain upsampling from 5% to 30% of training and find that 10% to 20% percent is optimal for navigating the tradeoff between general language modeling capabilities and targeted benchmarks. We also use domain upsampling to characterize at scale the utility of individual datasets for improving various benchmarks by removing them during this final phase of training. This tool opens up the ability to experiment with the impact of different pretraining datasets at scale, but at an order of magnitude lower cost compared to full pretraining runs.

Machine Learning,Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to find the optimal balance between general web - crawled data (such as CommonCrawl) and domain - specific data in the pre - training data of large - scale language models (LLMs). As the scale of pre - training data sets grows, it becomes very expensive to experimentally verify the impact of different data mixing strategies. Therefore, a major challenge for researchers is how to effectively evaluate and optimize these data - mixing strategies without conducting a full pre - training. Specifically, the paper proposes a simple but effective method - **Domain Upsampling**, that is, increasing the proportion of domain - specific data in the later stage of training to improve the performance of the model on difficult benchmark tests. This method not only significantly improves the model performance, but also the experimental cost is much lower than the full pre - training process. Through this method, researchers can explore the specific impact of different data sets on the model's capabilities at a lower cost, thus providing a new tool and method for the pre - training data selection of large - scale language models.

Does your data spark joy? Performance gains from domain upsampling at the end of training

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

Continual Pre-Training of Large Language Models: How to (re)warm your model?

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis

Improving Pretraining Data Using Perplexity Correlations

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

Downstream Datasets Make Surprisingly Good Pretraining Corpora

1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning

Language models scale reliably with over-training and on downstream tasks

Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

Training Bilingual LMs with Data Constraints in the Targeted Language

Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild