Abstract:Pretraining datasets for large language models (LLMs) have grown to trillions of tokens composed of large amounts of CommonCrawl (CC) web scrape along with smaller, domain-specific datasets. It is expensive to understand the impact of these domain-specific datasets on model capabilities as training at large FLOP scales is required to reveal significant changes to difficult and emergent benchmarks. Given the increasing cost of experimenting with pretraining data, how does one determine the optimal balance between the diversity in general web scrapes and the information density of domain specific data? In this work, we show how to leverage the smaller domain specific datasets by upsampling them relative to CC at the end of training to drive performance improvements on difficult benchmarks. This simple technique allows us to improve up to 6.90 pp on MMLU, 8.26 pp on GSM8K, and 6.17 pp on HumanEval relative to the base data mix for a 7B model trained for 1 trillion (T) tokens, thus rivaling Llama-2 (7B)$\unicode{x2014}$a model trained for twice as long. We experiment with ablating the duration of domain upsampling from 5% to 30% of training and find that 10% to 20% percent is optimal for navigating the tradeoff between general language modeling capabilities and targeted benchmarks. We also use domain upsampling to characterize at scale the utility of individual datasets for improving various benchmarks by removing them during this final phase of training. This tool opens up the ability to experiment with the impact of different pretraining datasets at scale, but at an order of magnitude lower cost compared to full pretraining runs.

On the importance of pre-training data volume for compact language models

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level

A Compact Pretraining Approach for Neural Language Models

Exploring Quantization for Efficient Pre-Training of Transformer Language Models

On the importance of Data Scale in Pretraining Arabic Language Models

Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study

PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models.

Training Bilingual LMs with Data Constraints in the Targeted Language

What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language Models

1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data

Prune Once for All: Sparse Pre-Trained Language Models

What Should Baby Models Read? Exploring Sample-Efficient Data Composition on Model Performance

The Effects of In-domain Corpus Size on pre-training BERT

Empirical Analysis of Efficient Fine-Tuning Methods for Large Pre-Trained Language Models

mALBERT: Is a Compact Multilingual BERT Model Still Worth It?

Language Model Pre-training with Linguistically Motivated Curriculum Learning

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Does your data spark joy? Performance gains from domain upsampling at the end of training

Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?