Abstract:We investigate the potential constraints on LLM scaling posed by the availability of public human-generated text data. We forecast the growing demand for training data based on current trends and estimate the total stock of public human text data. Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained. We explore how progress in language modeling can continue when human-generated text datasets cannot be scaled any further. We argue that synthetic data generation, transfer learning from data-rich domains, and data efficiency improvements might support further progress.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper explores whether the development trend of large language models (LLMs) will lead to the exhaustion of publicly available human-generated text data resources. Specifically, the authors investigate the following aspects: 1. **Availability of Data Resources**: With the development of LLMs, the demand for training data is continuously increasing. The authors estimate the total amount of currently available public human text data and predict the growth trend of this data over the next few years. 2. **Growth in Data Demand**: Based on current trends, the authors predict the growth rate of LLMs' demand for training data. They find that if the current trend continues, between 2026 and 2032, the size of LLM training datasets will approach or exceed the total amount of available public human text data. 3. **Impact of Data Resource Exhaustion**: The authors discuss the potential limitations on the development of language models when public human text data resources are exhausted. They propose several possible solutions, including synthetic data generation, transfer learning from data-rich domains, and improving data utilization efficiency. ### Main Conclusions - **Limited Data Resources**: According to the authors' estimates, by around 2028, public human text data resources will be fully utilized, posing a challenge to the further development of LLMs. - **Exploration of Solutions**: To address this challenge, the authors suggest methods such as synthetic data generation, transfer learning, and data efficiency improvements to continue advancing language models. ### Relevant Background - **Data Scarcity Issue**: In recent years, the development of large language models has relied on a vast amount of human-generated text data. However, the availability of this data is limited, and its growth rate may not keep pace with the demands of model development. - **Data Quality and Multi-Round Training**: In addition to the limitation of data quantity, data quality and the effects of multi-round training are also important factors affecting model performance. The authors consider these factors in their model to more accurately predict the utilization of data resources. ### Method Overview - **Database Stock Estimation**: The authors estimate the total amount of public text data on the internet by analyzing publicly available datasets such as Common Crawl. - **Data Demand Prediction**: By analyzing historical data growth trends, the authors predict the future demand for training data by LLMs over the next few years. - **Model Adjustment**: Considering the impact of data quality and multi-round training, the model is adjusted to more accurately reflect the actual situation. ### Significance of the Conclusions This study provides important references for understanding the potential bottlenecks in the development of large language models. By predicting the utilization of data resources, researchers can better plan future research directions, seek new data sources, and develop technical means to support the continuous advancement of language models.

Will we run out of data? Limits of LLM scaling based on human-generated data

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods

Under the Surface: Tracking the Artifactuality of LLM-Generated Data

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

AI models collapse when trained on recursively generated data

The Potential and Limitations of Large Language Models for Text Classification through Synthetic Data Generation

A Survey on Data Synthesis and Augmentation for Large Language Models

Scaling Properties of Speech Language Models

Exploring LLMs as a Source of Targeted Synthetic Textual Data to Minimize High Confidence Misclassifications

Making Large Language Models Better Data Creators

Regurgitative Training: The Value of Real Data in Training Large Language Models

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

Can LLMs Augment Low-Resource Reading Comprehension Datasets? Opportunities and Challenges

Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training Loop

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of Large Language Models

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

Densing Law of LLMs

Evaluating, Understanding, and Improving Constrained Text Generation for Large Language Models

Scaling Laws for Neural Language Models