Data, Data Everywhere: A Guide for Pretraining Dataset Construction

Jupinder Parmar,Shrimai Prabhumoye,Joseph Jennings,Bo Liu,Aastha Jhunjhunwala,Zhilin Wang,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro
2024-10-20
Abstract:The impressive capabilities of recent language models can be largely attributed to the multi-trillion token pretraining datasets that they are trained on. However, model developers fail to disclose their construction methodology which has lead to a lack of open information on how to develop effective pretraining sets. To address this issue, we perform the first systematic study across the entire pipeline of pretraining set construction. First, we run ablations on existing techniques for pretraining set development to identify which methods translate to the largest gains in model accuracy on downstream evaluations. Then, we categorize the most widely used data source, web crawl snapshots, across the attributes of toxicity, quality, type of speech, and domain. Finally, we show how such attribute information can be used to further refine and improve the quality of a pretraining set. These findings constitute an actionable set of steps that practitioners can use to develop high quality pretraining sets.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively construct high - quality pre - training data sets. Specifically, the paper points out that although recent language models have shown strong capabilities in multiple evaluation areas, and these capabilities are largely attributed to the pre - training data sets of more than trillion - level tokens on which they are based, model developers have not made their data set construction methods public. This has led to a lack of transparency in information on how to develop effective pre - training data sets. To solve this problem, the paper has carried out the first systematic study covering the entire process of pre - training data set construction. The main objectives include: 1. **Identify effective methods**: Conduct ablation experiments by comparing existing technologies to determine which methods can significantly improve the accuracy of the model in downstream tasks. 2. **Classify common data sources**: Classify the most commonly used web crawler snapshot data sources according to toxicity, quality, language type and domain attributes. 3. **Use attribute information to improve data sets**: Demonstrate how to use this attribute information to further optimize and improve the quality of pre - training data sets. The main contribution of the paper lies in providing a set of specific steps for practitioners to construct high - performance pre - training data sets. Specific contributions include: - Suggesting a series of techniques for data cleaning, selection and sampling steps applicable to English, multilingual and code data. - Conducting the first large - scale analysis of web crawler data attributes, covering quality, toxicity, language type and domain. - Demonstrating how to use attribute information to enhance the performance of data sampling and data selection methods. Through these studies, the paper fills the knowledge gap in the field of pre - training data set construction and provides valuable guidance for the community.