Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning

Yang Zhao,Li Du,Xiao Ding,Kai Xiong,Zhouhao Sun,Jun Shi,Ting Liu,Bing Qin
2024-08-28
Abstract:Through pretraining on a corpus with various sources, Large Language Models (LLMs) have gained impressive performance. However, the impact of each component of the pretraining corpus remains opaque. As a result, the organization of the pretraining corpus is still empirical and may deviate from the optimal. To address this issue, we systematically analyze the impact of 48 datasets from 5 major categories of pretraining data of LLMs and measure their impacts on LLMs using benchmarks about nine major categories of model capabilities. Our analyses provide empirical results about the contribution of multiple corpora on the performances of LLMs, along with their joint impact patterns, including complementary, orthogonal, and correlational relationships. We also identify a set of ``high-impact data'' such as Books that is significantly related to a set of model capabilities. These findings provide insights into the organization of data to support more efficient pretraining of LLMs.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **Quantify and understand the specific impact of data from different sources and types in the pre - training corpora of large - scale language models (LLMs) on model performance**. Specifically, the paper aims to solve the following problems: 1. **Transparency issue**: - Large - scale language models use corpora from multiple sources in the pre - training stage, but the specific impact of each type of data is still unclear. This opacity makes the organization of pre - training corpora still rely on experience and may deviate from the optimal configuration. 2. **Optimizing pre - training data**: - The existing ways of organizing pre - training corpora are mainly based on experience and lack systematic analysis and optimization. Therefore, how to organize pre - training data more efficiently to improve model performance is an urgent problem to be solved. 3. **Challenges in data impact analysis (DIA)**: - Traditional data impact analysis methods (such as retraining methods and gradient methods) face problems such as high computational cost and unreasonable assumptions when applied to large - scale language models, and it is difficult to effectively evaluate the impact of different data sources. To solve these problems, the author proposes a method based on machine unlearning to quantify the impact of different data sources on the performance of large - scale language models. By "forgetting" specific data and comparing the performance of the model before and after forgetting, the role of various pre - training data can be systematically analyzed, thus providing empirical evidence for optimizing pre - training corpora. ### Main research contents - **Systematically analyze 48 pre - training data sets of different categories**, covering texts, common - sense knowledge, domain - specific knowledge, mathematics and programming, etc. - **Measure the impact of these data sets on nine types of model capabilities**, including language modeling, text understanding, reasoning, code generation, etc. - **Identify "high - impact data" that has a significant impact on model capabilities**, such as books (Books), and explore their joint impact patterns, including complementary, orthogonal and correlation relationships. ### Method innovation points - **Introduce the GRACE algorithm that combines gradient ascent and retraining** to ensure the effectiveness and accuracy of the forgetting process and avoid unnecessary impacts on non - target data. - **Use the randomized text method to determine the forgetting end point** so that the state of the model after forgetting is close to the state of never having seen the target data. ### Conclusions and implications - The research results reveal the contributions of different types of data to model capabilities and find the significant impact of certain data sets (such as books) on specific tasks. - Provide practical suggestions on how to optimize pre - training corpora, such as the proportion setting of various types of data, data set arrangement and the evaluation of the pre - training process. Through these studies, the author hopes to provide theoretical basis and technical support for the future optimization of pre - training data for large - scale language models.