Abstract:Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we pretrain 28 1.5B parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we quantify the effect of pretraining data age. A temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we explore the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Lastly, we empirically validate that the inclusion of heterogeneous data sources, like books and web, is broadly beneficial and warrants greater prioritization. These findings constitute the largest set of experiments to validate, quantify, and expose many undocumented intuitions about text pretraining, which we hope will help support more informed data-centric decisions in LM development.

What problem does this paper attempt to address?

The paper primarily explores the impact of pre-training data on the performance of language models, particularly in terms of data timeliness, domain coverage, quality, and toxicity filtering. Here is a summary of the key issues the paper attempts to address: 1. **Impact of Data Timeliness**: - The study evaluates the effect of the time gap between the data and the pre-training data on model performance, finding that this temporal mismatch leads to performance degradation, which is difficult to overcome through fine-tuning. - The paper also notes that this phenomenon is more pronounced in larger models. 2. **Impact of Quality and Toxicity Filtering**: - It explores the impact of different types of document quality filters (e.g., removing low-quality text) and toxicity filters (e.g., removing toxic or offensive content) on model behavior. - It finds that while quality filtering reduces the amount of training data, it significantly improves downstream task performance and the likelihood of generating toxic content. - At the same time, removing toxic data reduces the generation of toxic content but sacrifices some generalization ability. 3. **Impact of Domain Combination**: - It analyzes the combined impact of data from different sources (such as books and web data) on model performance, showing that these diverse data sources are generally beneficial. - Although book and web data sources may increase the generation of toxic content, they have a positive impact on the overall performance of the model. 4. **Recommendations and Suggestions**: - The paper suggests collecting more books and diverse web data in the future to further improve model performance. - It proposes the view that there is no "one-size-fits-all" filtering strategy suitable for all situations, thus more targeted quality or reverse toxicity filters need to be developed based on specific tasks. In summary, this study reveals how pre-training data design decisions affect the performance of language models through systematic experiments on a large number of pre-trained models, providing valuable insights and recommendations for model developers.

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Data, Data Everywhere: A Guide for Pretraining Dataset Construction

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

Does your data spark joy? Performance gains from domain upsampling at the end of training

Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning

1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data

Adding Instructions during Pretraining: Effective Way of Controlling Toxicity in Language Models

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

Improving Pretraining Data Using Perplexity Correlations

Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models

Investigating Data Contamination for Pre-training Language Models

In-context Pretraining: Language Modeling Beyond Document Boundaries

Pre-Training a Language Model Without Human Language

How to Train Data-Efficient LLMs

To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on Resource Rich Tasks

Downstream Datasets Make Surprisingly Good Pretraining Corpora

The Role of Pre-training Data in Transfer Learning

Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese