A Survey on Data Synthesis and Augmentation for Large Language Models

Ke Wang,Jiahui Zhu,Minjie Ren,Zeming Liu,Shiwei Li,Zongye Zhang,Chenkai Zhang,Xiaoyu Wu,Qiqi Zhan,Qingjie Liu,Yunhong Wang
2024-10-17
Abstract:The success of Large Language Models (LLMs) is inherently linked to the availability of vast, diverse, and high-quality data for training and evaluation. However, the growth rate of high-quality data is significantly outpaced by the expansion of training datasets, leading to a looming data exhaustion crisis. This underscores the urgent need to enhance data efficiency and explore new data sources. In this context, synthetic data has emerged as a promising solution. Currently, data generation primarily consists of two major approaches: data augmentation and synthesis. This paper comprehensively reviews and summarizes data generation techniques throughout the lifecycle of LLMs, including data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and applications. Furthermore, We discuss the current constraints faced by these methods and investigate potential pathways for future development and research. Our aspiration is to equip researchers with a clear understanding of these methodologies, enabling them to swiftly identify appropriate data generation strategies in the construction of LLMs, while providing valuable insights for future exploration.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the scarcity of high - quality data in the development of large - language models (LLMs). As the scale of models continues to grow, the demand for large - scale, diverse, and high - quality training data is increasing day by day. However, the growth rate of high - quality data lags far behind the expansion rate of training data sets, which may lead to a data - depletion crisis. Therefore, the paper emphasizes the urgency of improving data efficiency and exploring new data sources, especially in terms of synthetic data. The paper comprehensively reviews and summarizes the data - generation techniques throughout the life cycle of LLMs, including aspects such as data preparation, pre - training, fine - tuning, instruction tuning, preference alignment, and application. In addition, the paper also discusses the current limitations faced by these methods and explores potential paths for future development, aiming to provide researchers with a clear methodological understanding, enabling them to quickly identify applicable data - generation strategies when constructing LLMs and provide valuable insights for future exploration.