Training Data for Large Language Model

Yiming Ju,Huanhuan Ma

2024-11-12

Abstract:In 2022, with the release of ChatGPT, large-scale language models gained widespread attention. ChatGPT not only surpassed previous models in terms of parameters and the scale of its pretraining corpus but also achieved revolutionary performance improvements through fine-tuning on a vast amount of high-quality, human-annotated data. This progress has led enterprises and research institutions to recognize that building smarter and more powerful models relies on rich and high-quality datasets. Consequently, the construction and optimization of datasets have become a critical focus in the field of artificial intelligence. This paper summarizes the current state of pretraining and fine-tuning data for training large-scale language models, covering aspects such as data scale, collection methods, data types and characteristics, processing workflows, and provides an overview of available open-source datasets.

Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is about the current situation and optimization directions of training data for large - scale language models. Specifically, the paper focuses on the following aspects: 1. **Data Scale**: Explore the scale of pre - training data and fine - tuning data required for large - scale language models, and emphasize the importance of large - scale data sets for improving model performance. 2. **Data Sources**: Summarize different types of data sources, including web pages, books, academic materials, code, social media, encyclopedias, etc., and analyze the characteristics and advantages of these data. 3. **Data Processing**: Detail the processing flow of pre - training data, including steps such as deduplication, filtering, and cleaning, and emphasize the importance of high - quality data for model training. 4. **Current Situation of Data Sets**: Sort out the currently available open - source data sets at home and abroad, and analyze the characteristics and application scenarios of these data sets. 5. **Future Development Directions**: Discuss the key points of future data set construction, including data distribution diversity, data quality and interpretability, as well as data security and privacy protection. In summary, this paper aims to provide guidance and suggestions for the construction and optimization of data sets by comprehensively summarizing the current situation of training data for large - scale language models.

Training Data for Large Language Model

Taking ChatGPT as an example to analyze the main technologies used in large language models

Large Language Models as Data Preprocessors

A Survey of Large Language Models

Improving Text Classification with Large Language Model-Based Data Augmentation

Large Language Models: A Survey

Navigating the Landscape of Large Language Models: A Comprehensive Review and Analysis of Paradigms and Fine-Tuning Strategies

Datasets for Large Language Models: A Comprehensive Survey

Data Management For Training Large Language Models: A Survey

ChatGPT, an Opportunity to Understand More About Language Models

A Survey of LLM Datasets: From Autoregressive Model to AI Chatbot

Summary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large Language Models

Summary of ChatGPT-Related Research and Perspective Towards the Future of Large Language Models

Analysis of the Technical Principles of ChatGPT and Prospects for Pre-trained Large Models

Distributed Training of Large Language Models

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning

Data, Data Everywhere: A Guide for Pretraining Dataset Construction

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning

A Survey on Data Augmentation in Large Model Era