Abstract:The impressive capabilities of recent language models can be largely attributed to the multi-trillion token pretraining datasets that they are trained on. However, model developers fail to disclose their construction methodology which has lead to a lack of open information on how to develop effective pretraining sets. To address this issue, we perform the first systematic study across the entire pipeline of pretraining set construction. First, we run ablations on existing techniques for pretraining set development to identify which methods translate to the largest gains in model accuracy on downstream evaluations. Then, we categorize the most widely used data source, web crawl snapshots, across the attributes of toxicity, quality, type of speech, and domain. Finally, we show how such attribute information can be used to further refine and improve the quality of a pretraining set. These findings constitute an actionable set of steps that practitioners can use to develop high quality pretraining sets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively construct high - quality pre - training data sets. Specifically, the paper points out that although recent language models have shown strong capabilities in multiple evaluation areas, and these capabilities are largely attributed to the pre - training data sets of more than trillion - level tokens on which they are based, model developers have not made their data set construction methods public. This has led to a lack of transparency in information on how to develop effective pre - training data sets. To solve this problem, the paper has carried out the first systematic study covering the entire process of pre - training data set construction. The main objectives include: 1. **Identify effective methods**: Conduct ablation experiments by comparing existing technologies to determine which methods can significantly improve the accuracy of the model in downstream tasks. 2. **Classify common data sources**: Classify the most commonly used web crawler snapshot data sources according to toxicity, quality, language type and domain attributes. 3. **Use attribute information to improve data sets**: Demonstrate how to use this attribute information to further optimize and improve the quality of pre - training data sets. The main contribution of the paper lies in providing a set of specific steps for practitioners to construct high - performance pre - training data sets. Specific contributions include: - Suggesting a series of techniques for data cleaning, selection and sampling steps applicable to English, multilingual and code data. - Conducting the first large - scale analysis of web crawler data attributes, covering quality, toxicity, language type and domain. - Demonstrating how to use attribute information to enhance the performance of data sampling and data selection methods. Through these studies, the paper fills the knowledge gap in the field of pre - training data set construction and provides valuable guidance for the community.

Data, Data Everywhere: A Guide for Pretraining Dataset Construction

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Training Data for Large Language Model

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Downstream Datasets Make Surprisingly Good Pretraining Corpora

Does your data spark joy? Performance gains from domain upsampling at the end of training

Investigating Data Contamination for Pre-training Language Models

Are we pretraining it right? Digging deeper into visio-linguistic pretraining

Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

RedPajama: an Open Dataset for Training Large Language Models

Pre-Trained Language Models and Their Applications

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

A Survey on Data Selection for Language Models

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models

Table Pre-training: A Survey on Model Architectures, Pre-training Objectives, and Downstream Tasks

The Role of Pre-training Data in Transfer Learning

Recent Advances in Pre-trained Language Models: Why Do They Work and How Do They Work