Abstract:Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While training on ever-larger portions of the internet leads to consistent performance improvements, the size of these improvements diminishes with scale, and there has been little work exploring the effect of data selection on pre-training and downstream performance beyond simple de-duplication methods such as MinHash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains) and improves average downstream accuracy on 16 NLP tasks (up to 2%) at the 6.7B model scale. Furthermore, we show that repeating data intelligently consistently outperforms baseline training (while repeating random data performs worse than baseline training). Our results indicate that clever data selection can significantly improve LLM pre-training, calls into question the common practice of training for a single epoch on as much data as possible, and demonstrates a path to keep improving our models past the limits of randomly sampling web data.

What problem does this paper attempt to address?

This paper primarily explores how to improve the pre-training effectiveness of large language models (LLMs) by enhancing data selection strategies. Specifically, the research team proposed a method called D4 (Document De-Duplication and Diversification), which aims to optimize the pre-training process by removing duplicate data and increasing data diversity. ### Research Background and Objectives With the significant increase in computational resources and data volume in recent years, large language models are typically trained by randomly selecting as much data as possible from large-scale web corpora for one-time learning. Although this approach has been successful in improving model performance, its marginal gains decrease as the data scale increases. Moreover, there is relatively little research on the impact of data selection on pre-training and downstream task performance, with most work still relying on simple deduplication methods (such as Min-Hash). ### Main Contributions 1. **Proposing the D4 Method**: The authors proposed a new data selection strategy, D4, which combines document deduplication (SemDeDup) and cluster-based prototype selection (Prototypicality) to avoid the negative impact of cluster-driven duplicate data on model training. 2. **Experimental Results**: - Under fixed computational resources, using the D4 method can select higher quality data subsets, thereby accelerating the training process and improving the accuracy of downstream tasks. - When data is limited and requires multiple passes over the same dataset, carefully selecting reused data through the D4 method can significantly improve model performance, outperforming the random selection of new data. 3. **Efficiency Analysis**: Although the D4 method requires additional computational costs for data selection, the overall efficiency improvement it brings far exceeds this cost. For example, on a 670 million parameter model, the D4 method saved a significant number of GPU hours, indicating that the D4 method has practical application value. ### Conclusion Through the D4 method, researchers demonstrated that even simple data selection strategies can significantly improve the pre-training effectiveness of large language models. This method not only enhances the efficiency of model training but also further improves model performance when data is limited, providing new insights for future training of language models using large-scale web data.

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Does your data spark joy? Performance gains from domain upsampling at the end of training

Improving Pretraining Data Using Perplexity Correlations

Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning

Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining

How to Train Data-Efficient LLMs

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

On the Diversity of Synthetic Data and its Impact on Training Large Language Models

Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging

Efficient Online Data Mixing For Language Model Pre-Training

Continual Pre-Training of Large Language Models: How to (re)warm your model?

SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Enhancing Discriminative Tasks by Guiding the Pre-trained Language Model with Large Language Model's Experience

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity