D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Kushal Tirumala,Daniel Simig,Armen Aghajanyan,Ari S. Morcos
2023-08-24
Abstract:Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While training on ever-larger portions of the internet leads to consistent performance improvements, the size of these improvements diminishes with scale, and there has been little work exploring the effect of data selection on pre-training and downstream performance beyond simple de-duplication methods such as MinHash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains) and improves average downstream accuracy on 16 NLP tasks (up to 2%) at the 6.7B model scale. Furthermore, we show that repeating data intelligently consistently outperforms baseline training (while repeating random data performs worse than baseline training). Our results indicate that clever data selection can significantly improve LLM pre-training, calls into question the common practice of training for a single epoch on as much data as possible, and demonstrates a path to keep improving our models past the limits of randomly sampling web data.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper primarily explores how to improve the pre-training effectiveness of large language models (LLMs) by enhancing data selection strategies. Specifically, the research team proposed a method called D4 (Document De-Duplication and Diversification), which aims to optimize the pre-training process by removing duplicate data and increasing data diversity. ### Research Background and Objectives With the significant increase in computational resources and data volume in recent years, large language models are typically trained by randomly selecting as much data as possible from large-scale web corpora for one-time learning. Although this approach has been successful in improving model performance, its marginal gains decrease as the data scale increases. Moreover, there is relatively little research on the impact of data selection on pre-training and downstream task performance, with most work still relying on simple deduplication methods (such as Min-Hash). ### Main Contributions 1. **Proposing the D4 Method**: The authors proposed a new data selection strategy, D4, which combines document deduplication (SemDeDup) and cluster-based prototype selection (Prototypicality) to avoid the negative impact of cluster-driven duplicate data on model training. 2. **Experimental Results**: - Under fixed computational resources, using the D4 method can select higher quality data subsets, thereby accelerating the training process and improving the accuracy of downstream tasks. - When data is limited and requires multiple passes over the same dataset, carefully selecting reused data through the D4 method can significantly improve model performance, outperforming the random selection of new data. 3. **Efficiency Analysis**: Although the D4 method requires additional computational costs for data selection, the overall efficiency improvement it brings far exceeds this cost. For example, on a 670 million parameter model, the D4 method saved a significant number of GPU hours, indicating that the D4 method has practical application value. ### Conclusion Through the D4 method, researchers demonstrated that even simple data selection strategies can significantly improve the pre-training effectiveness of large language models. This method not only enhances the efficiency of model training but also further improves model performance when data is limited, providing new insights for future training of language models using large-scale web data.