Abstract:Through pretraining on a corpus with various sources, Large Language Models (LLMs) have gained impressive performance. However, the impact of each component of the pretraining corpus remains opaque. As a result, the organization of the pretraining corpus is still empirical and may deviate from the optimal. To address this issue, we systematically analyze the impact of 48 datasets from 5 major categories of pretraining data of LLMs and measure their impacts on LLMs using benchmarks about nine major categories of model capabilities. Our analyses provide empirical results about the contribution of multiple corpora on the performances of LLMs, along with their joint impact patterns, including complementary, orthogonal, and correlational relationships. We also identify a set of ``high-impact data'' such as Books that is significantly related to a set of model capabilities. These findings provide insights into the organization of data to support more efficient pretraining of LLMs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Quantify and understand the specific impact of data from different sources and types in the pre - training corpora of large - scale language models (LLMs) on model performance**. Specifically, the paper aims to solve the following problems: 1. **Transparency issue**: - Large - scale language models use corpora from multiple sources in the pre - training stage, but the specific impact of each type of data is still unclear. This opacity makes the organization of pre - training corpora still rely on experience and may deviate from the optimal configuration. 2. **Optimizing pre - training data**: - The existing ways of organizing pre - training corpora are mainly based on experience and lack systematic analysis and optimization. Therefore, how to organize pre - training data more efficiently to improve model performance is an urgent problem to be solved. 3. **Challenges in data impact analysis (DIA)**: - Traditional data impact analysis methods (such as retraining methods and gradient methods) face problems such as high computational cost and unreasonable assumptions when applied to large - scale language models, and it is difficult to effectively evaluate the impact of different data sources. To solve these problems, the author proposes a method based on machine unlearning to quantify the impact of different data sources on the performance of large - scale language models. By "forgetting" specific data and comparing the performance of the model before and after forgetting, the role of various pre - training data can be systematically analyzed, thus providing empirical evidence for optimizing pre - training corpora. ### Main research contents - **Systematically analyze 48 pre - training data sets of different categories**, covering texts, common - sense knowledge, domain - specific knowledge, mathematics and programming, etc. - **Measure the impact of these data sets on nine types of model capabilities**, including language modeling, text understanding, reasoning, code generation, etc. - **Identify "high - impact data" that has a significant impact on model capabilities**, such as books (Books), and explore their joint impact patterns, including complementary, orthogonal and correlation relationships. ### Method innovation points - **Introduce the GRACE algorithm that combines gradient ascent and retraining** to ensure the effectiveness and accuracy of the forgetting process and avoid unnecessary impacts on non - target data. - **Use the randomized text method to determine the forgetting end point** so that the state of the model after forgetting is close to the state of never having seen the target data. ### Conclusions and implications - The research results reveal the contributions of different types of data to model capabilities and find the significant impact of certain data sets (such as books) on specific tasks. - Provide practical suggestions on how to optimize pre - training corpora, such as the proportion setting of various types of data, data set arrangement and the evaluation of the pre - training process. Through these studies, the author hopes to provide theoretical basis and technical support for the future optimization of pre - training data for large - scale language models.

Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Large Language Models as Data Preprocessors

Datasets for Large Language Models: A Comprehensive Survey

Scalable Influence and Fact Tracing for Large Language Model Pretraining

How to Train Data-Efficient LLMs

The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis

Improving Pretraining Data Using Perplexity Correlations

Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

Data Proportion Detection for Optimized Data Management for Large Language Models

Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

On the importance of Data Scale in Pretraining Arabic Language Models

Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

From Pre-training Corpora to Large Language Models: What Factors Influence LLM Performance in Causal Discovery Tasks?

On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets

Does your data spark joy? Performance gains from domain upsampling at the end of training

Training Data for Large Language Model

Language Contamination Helps Explain the Cross-lingual Capabilities of English Pretrained Models