Abstract:Large language models (LLMs) have demonstrated exceptional performance across a wide range of tasks and domains, with data preparation playing a critical role in achieving these results. Pre-training data typically combines information from multiple domains. To maximize performance when integrating data from various domains, determining the optimal data proportion is essential. However, state-of-the-art (SOTA) LLMs rarely disclose details about their pre-training data, making it difficult for researchers to identify ideal data proportions. In this paper, we introduce a new topic, \textit{data proportion detection}, which enables the automatic estimation of pre-training data proportions by analyzing the generated outputs of LLMs. We provide rigorous theoretical proofs, practical algorithms, and preliminary experimental results for data proportion detection. Based on these findings, we offer valuable insights into the challenges and future directions for effective data proportion detection and data management.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the critical issue of managing pre-training data for large language models (LLMs), specifically how to determine the optimal data proportion. Specifically, the paper introduces a new concept—**Data Proportion Detection**, which can automatically estimate the proportion of pre-training data by analyzing the outputs generated by LLMs. ### Background and Motivation 1. **Ineffective Training Results**: Without an optimal data proportion, pre-training models struggle to learn effectively, leading to poor performance in downstream tasks due to insufficient learning of key features. 2. **Wasted Computational Resources**: The lack of an optimal data proportion forces models to use excessive computational resources, prolonging training time and increasing overall costs. 3. **Wasted Data and Management Costs**: Suboptimal data proportions reduce the value extracted from available data while increasing data management costs, as both overuse and underuse of data lead to unnecessary expenses. ### Solution To address these challenges, the paper introduces the concept of data proportion detection and provides the following contributions: 1. **New Perspective**: Proposes a new method to detect data proportions in closed-source pre-training models, which helps manage pre-training data proportions effectively, thereby reducing the cost of adjusting data proportions during pre-training. 2. **New Topic**: Based on a rigorous theory of data distribution, proposes a data proportion detection algorithm and conducts preliminary data proportion detection experiments, establishing a baseline for future research. 3. **New Challenges in Data Management**: Outlines three key challenges in data management: rapid large-scale LLM inference systems, robust data cleaning and classification systems, next-generation data mixing laws, and the robustness of data preparation systems. ### Methods and Experiments 1. **Problem Definition**: Defines the data proportion detection problem, where the input is a trained LLM, and the output is the estimated data proportions across different domains. 2. **Theoretical Proof**: Derives a formula to estimate pre-training data proportions by generating the relationship between data proportions and the loss function. 3. **Practical Algorithm**: Proposes a practical algorithm that includes three steps: data generation, classification, and data proportion estimation. 4. **Preliminary Experiments**: Uses the MAP-NEO 7B Base model to generate 100,000 data points and estimates the data proportions across different domains using a classification model. Preliminary experimental results demonstrate the method's effectiveness and the challenges faced. ### Challenges and Future Directions 1. **Rapid Large-Scale LLM Inference Systems**: The need to develop rapid large-scale LLM inference frameworks to support the processing of large amounts of generated data. 2. **Robust Data Cleaning and Classification Systems**: The need for specialized data cleaning techniques to handle generated data and develop more reliable data classification systems. 3. **Next-Generation Data Mixing Laws**: The need for more precise data mixing laws to describe the impact of data proportions on model performance. 4. **Robust Data Preparation Systems**: The need for robust data management systems to handle over 30TB of pre-training data. ### Conclusion The data proportion detection method proposed in the paper provides a new perspective and practical tool for optimizing LLM pre-training data management. Although preliminary experiments demonstrate the method's effectiveness, several challenges remain. Future research will focus on addressing these issues to further enhance LLM performance and data management efficiency.

Data Proportion Detection for Optimized Data Management for Large Language Models

Data Management For Training Large Language Models: A Survey

How to Train Data-Efficient LLMs

Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models

Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning

Adaptive Pre-training Data Detection for Large Language Models via Surprising Tokens

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method

Detecting Pretraining Data from Large Language Models

Data-freeWeight Compress and Denoise for Large Language Models

Data Selection via Optimal Control for Language Models

Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration

A Survey of Multimodal Large Language Model from A Data-centric Perspective

How Good Are LLMs at Out-of-Distribution Detection?

Efficient Online Data Mixing For Language Model Pre-Training

Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

Probing Language Models for Pre-training Data Detection

Fine-tuning can Help Detect Pretraining Data from Large Language Models

A Survey on Efficient Inference for Large Language Models

Demystifying Data Management for Large Language Models