Training Data for Large Language Model

Yiming Ju,Huanhuan Ma
2024-11-12
Abstract:In 2022, with the release of ChatGPT, large-scale language models gained widespread attention. ChatGPT not only surpassed previous models in terms of parameters and the scale of its pretraining corpus but also achieved revolutionary performance improvements through fine-tuning on a vast amount of high-quality, human-annotated data. This progress has led enterprises and research institutions to recognize that building smarter and more powerful models relies on rich and high-quality datasets. Consequently, the construction and optimization of datasets have become a critical focus in the field of artificial intelligence. This paper summarizes the current state of pretraining and fine-tuning data for training large-scale language models, covering aspects such as data scale, collection methods, data types and characteristics, processing workflows, and provides an overview of available open-source datasets.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is about the current situation and optimization directions of training data for large - scale language models. Specifically, the paper focuses on the following aspects: 1. **Data Scale**: Explore the scale of pre - training data and fine - tuning data required for large - scale language models, and emphasize the importance of large - scale data sets for improving model performance. 2. **Data Sources**: Summarize different types of data sources, including web pages, books, academic materials, code, social media, encyclopedias, etc., and analyze the characteristics and advantages of these data. 3. **Data Processing**: Detail the processing flow of pre - training data, including steps such as deduplication, filtering, and cleaning, and emphasize the importance of high - quality data for model training. 4. **Current Situation of Data Sets**: Sort out the currently available open - source data sets at home and abroad, and analyze the characteristics and application scenarios of these data sets. 5. **Future Development Directions**: Discuss the key points of future data set construction, including data distribution diversity, data quality and interpretability, as well as data security and privacy protection. In summary, this paper aims to provide guidance and suggestions for the construction and optimization of data sets by comprehensively summarizing the current situation of training data for large - scale language models.