Will we run out of data? Limits of LLM scaling based on human-generated data

Pablo Villalobos,Anson Ho,Jaime Sevilla,Tamay Besiroglu,Lennart Heim,Marius Hobbhahn
2024-06-05
Abstract:We investigate the potential constraints on LLM scaling posed by the availability of public human-generated text data. We forecast the growing demand for training data based on current trends and estimate the total stock of public human text data. Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained. We explore how progress in language modeling can continue when human-generated text datasets cannot be scaled any further. We argue that synthetic data generation, transfer learning from data-rich domains, and data efficiency improvements might support further progress.
Machine Learning,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Computers and Society
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper explores whether the development trend of large language models (LLMs) will lead to the exhaustion of publicly available human-generated text data resources. Specifically, the authors investigate the following aspects: 1. **Availability of Data Resources**: With the development of LLMs, the demand for training data is continuously increasing. The authors estimate the total amount of currently available public human text data and predict the growth trend of this data over the next few years. 2. **Growth in Data Demand**: Based on current trends, the authors predict the growth rate of LLMs' demand for training data. They find that if the current trend continues, between 2026 and 2032, the size of LLM training datasets will approach or exceed the total amount of available public human text data. 3. **Impact of Data Resource Exhaustion**: The authors discuss the potential limitations on the development of language models when public human text data resources are exhausted. They propose several possible solutions, including synthetic data generation, transfer learning from data-rich domains, and improving data utilization efficiency. ### Main Conclusions - **Limited Data Resources**: According to the authors' estimates, by around 2028, public human text data resources will be fully utilized, posing a challenge to the further development of LLMs. - **Exploration of Solutions**: To address this challenge, the authors suggest methods such as synthetic data generation, transfer learning, and data efficiency improvements to continue advancing language models. ### Relevant Background - **Data Scarcity Issue**: In recent years, the development of large language models has relied on a vast amount of human-generated text data. However, the availability of this data is limited, and its growth rate may not keep pace with the demands of model development. - **Data Quality and Multi-Round Training**: In addition to the limitation of data quantity, data quality and the effects of multi-round training are also important factors affecting model performance. The authors consider these factors in their model to more accurately predict the utilization of data resources. ### Method Overview - **Database Stock Estimation**: The authors estimate the total amount of public text data on the internet by analyzing publicly available datasets such as Common Crawl. - **Data Demand Prediction**: By analyzing historical data growth trends, the authors predict the future demand for training data by LLMs over the next few years. - **Model Adjustment**: Considering the impact of data quality and multi-round training, the model is adjusted to more accurately reflect the actual situation. ### Significance of the Conclusions This study provides important references for understanding the potential bottlenecks in the development of large language models. By predicting the utilization of data resources, researchers can better plan future research directions, seek new data sources, and develop technical means to support the continuous advancement of language models.